Next Article in Journal
Characterizing the Nature of Probability-Based Proof Number Search: A Case Study in the Othello and Connect Four Games
Next Article in Special Issue
Main Influencing Factors of Quality Determination of Collaborative Open Data Pages
Previous Article in Journal
Algorithmic Improvements of the KSU-STEM Method Verified on a Fund Portfolio Selection
Previous Article in Special Issue
Quality of Open Research Data: Values, Convergences and Governance
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Modeling Popularity and Reliability of Sources in Multilingual Wikipedia

by
Włodzimierz Lewoniewski
*,
Krzysztof Węcel
and
Witold Abramowicz
Department of Information Systems, Poznań University of Economics and Business, 61-875 Poznań, Poland
*
Author to whom correspondence should be addressed.
Information 2020, 11(5), 263; https://doi.org/10.3390/info11050263
Submission received: 31 March 2020 / Revised: 5 May 2020 / Accepted: 7 May 2020 / Published: 13 May 2020
(This article belongs to the Special Issue Quality of Open Data)

Abstract

:
One of the most important factors impacting quality of content in Wikipedia is presence of reliable sources. By following references, readers can verify facts or find more details about described topic. A Wikipedia article can be edited independently in any of over 300 languages, even by anonymous users, therefore information about the same topic may be inconsistent. This also applies to use of references in different language versions of a particular article, so the same statement can have different sources. In this paper we analyzed over 40 million articles from the 55 most developed language versions of Wikipedia to extract information about over 200 million references and find the most popular and reliable sources. We presented 10 models for the assessment of the popularity and reliability of the sources based on analysis of meta information about the references in Wikipedia articles, page views and authors of the articles. Using DBpedia and Wikidata we automatically identified the alignment of the sources to a specific domain. Additionally, we analyzed the changes of popularity and reliability in time and identified growth leaders in each of the considered months. The results can be used for quality improvements of the content in different languages versions of Wikipedia.

1. Introduction

Collaborative wiki services are becoming an increasingly popular source of knowledge in different countries. One of the most prominent examples of such free knowledge bases is Wikipedia. Nowadays this encyclopedia contains over 52 million articles in over 300 languages versions [1]. Articles in each language version can be created and edited even by anonymous (not registered) users. Moreover, due to the relative independence of contributors in each language, we can often encounter differences between articles about the same topic in various language versions of Wikipedia.
One of the most important elements that significantly affect the quality of information in Wikipedia is availability of a sufficient number of references to the sources. Those references can confirm facts provided in the articles. Therefore, community of the Wikipedians (editors who write and edit articles) attaches great importance to reliability of the sources. However, each language version can provide its own rules and criteria of reliability, as well as its own list of perennial sources whose use on Wikipedia are frequently discussed [2]. Moreover, this reliability criteria and list of reliable sources can change over time.
According to English Wikipedia content guidelines, information in the encyclopedia articles should be based on reliable, published sources. The word “source” in this case can have three interpretations [2]: the piece of work (e.g., a book, article, research), the creator of the work (e.g., a scientist, writer, journalist), the publisher of the work (e.g., MDPI or Springer). The term “published” is often associated with text materials in printed format or online. Information in other format (e.g., audio, video) also can be considered as a reliable source if it was recorded or distributed by a reputable party.
The reliability of a source in Wikipedia articles depends on context. Academic and peer-reviewed publications as well as textbooks are usually the most reliable sources in Wikipedia. At the same time not all scholarly materials can meet reliability criteria: some works may be outdated or be in competition with other research in the field, or even controversial within other theories. Another popular source of Wikipedia information are well-established press agencies. News reporting from such sources is generally considered to be reliable for statements of fact [2]. However, we need to take precautions when reporting breaking-news as they can contain serious inaccuracies.
Despite the fact that Wikipedia articles must present a neutral point of view, referenced sources are not required to be neutral, unbiased, or objective. However, websites whose content is largely user-generated is generally unacceptable. Such sites may include: personal or group blogs, content farms, forums, social media (e.g., Facebook, Reddit, Twitter), IMDb, most wikis (including Wikipedia) and others. Additionally, some sources can be deprecated or blacklisted on Wikipedia.
Given the fact that there are more than 1.5 billion websites on the World Wide Web [3], it is a challenging task to assess the reliability of all of them. Additionally, the reliability is a subjective concept related to information quality [4,5,6] and each source can be differently assessed depending on topic and language community of Wikipedia. It should also be taken into account that reputation of the newspaper or website can change over time and periodic re-assessment may be necessary.
According to the English Wikipedia content guideline [2]: “in general, the more people engaged in checking facts, analyzing legal issues and scrutinizing the writing, the more reliable the publication.” Related work described in Section 2 showed, that there is a field for improving approaches related to assessment of the sources based on publicly available data of Wikipedia using different measures of Wikipedia articles. Therefore, we decided to extract measures related to the demand for information and quality of articles and to use them to build 10 models for assessment of popularity and reliability of the source in different language versions in various periods. The simplest model was based on frequency of occurrence which is commonly used in other related works [7,8,9,10]. Other nine novel models used various combinations of measures related to quality and popularity of Wikipedia articles. The models were described in Section 3.
In order to extract sources from references of Wikipedia articles in different languages, we designed and implemented own algorithms in Python. In Section 4 we described basic and complex extraction methods of the references in Wikipedia articles. Based on extracted data from references in each Wikipedia article we added different measures related to popularity and quality of Wikipedia articles (such as pageviews, number of references, article length, number of authors) to assess sources. Based on the results we built rankings of the most popular and reliable sources in different languages editions of Wikipedia. Additionally, we compare positions of selected sources in reliability ranking in different language versions of Wikipedia in Section 5. We also assessed the similarity of the rankings of the most reliable sources obtained by different models in Section 6.
We also designed own algorithms in leveraging data from semantic databases (Wikidata and DBpedia) to extract additional metadata about the sources, conduct their unification and classification to find the most reliable in the specific domains. In Section 7 we showed results of analysis sources based on some parameters from citation templates (such as “publisher” and “journal”) and separately we showed the analysis the topics of sources based on semantic databases.
Using different periods we compared the result of popularity and reliability assessment of the sources in Section 8. Comparing the obtained results we were able to find growth leaders described in Section 9. We also presented the assessment of effectiveness of different models in Section 10.1. Additionally we provided information about limitation of the study in Section 10.2.

2. Recent Work

Due to the fact that source reliability is important in terms of quality assessment of Wikipedia articles, there is a wide range of works covering the field of references analysis of this encyclopedia.
Part of studies used reference counts in the models for automatic quality assessment of the Wikipedia articles. One of the first works in this direction used reference count as structural feature to predict the quality of Wikipedia articles [11,12]. Based on the references users can assess the trustworthiness of Wikipedia articles, therefore we consider the source of information as an important factor [13].
Often references contain an external link to the source page (URL), where cited information is placed. Therefore, including in models the number of the external links in Wikipedia articles can also help to assess information quality [14,15].
In addition to the analysis of quantity, there are studies analyzing the qualitative characteristics and metadata related to references. One of the works used special identifiers (such as DOI, ISBN) to unify the references and find the similarity of sources between language versions of Wikipedia [8]. Another recent study analyzed engagement with citations in Wikipedia articles and found that references are consulted more commonly when readers cannot find enough information in selected Wikipedia article [16]. There are also works, which showed that a lot of citations in Wikipedia articles refer to scientific publications [8,17], especially if they are open-access [18], wherein Wikipedia authors prefer to put recently published journal articles as a source [10]. Thus, Wikipedia is especially valuable due to the potential of direct linking to other primary sources. Another popular source of the information in Wikipedia is news website and there is a method for automatic suggestion of the news sources for the selected statements in articles [19].
Reference analysis can be important for quality assessment of Wikipedia articles. At the same time, articles with higher quality must have more proven and reliable sources. Therefore, in order to assess the reliability of specific source, we can analyze Wikipedia articles, in which related references are placed.
Relevance of article length and number of references for quality assessment of Wikipedia content was supported by many publications [15,20,21,22,23,24,25,26]. Particularly interesting is the combination of these indicators (e.g., references and articles length ratio) as it can be more actionable in quality prediction than each of them separately [27].
Information quality of Wikipedia depends also on authors who contributed to the article. Often articles with the high quality are jointly created by a large number of different Wikipedia users [28,29]. Therefore, we can use the number of unique authors as one of the measures of quality of Wikipedia articles [26,30,31]. Additionally, we can take into the account information about experience of Wikipedians [32].
One of the recent studies showed that after loading a page, 0.2% of the time the reader clicks on an external reference, 0.6% on an external link and 0.8% hovers over a reference [9]. Therefore, popularity can play an important role not only for quality estimation of information in specific language version of Wikipedia [33] but also for checking reliability of the sources in it. Larger number of readers of a Wikipedia article may allow for more rapid changes in incorrect or outdated information [26]. Popularity of an article can be measured based on the number of visits [34].
Taking into account different studies related to reference analysis and quality assessment of Wikipedia articles, we created 10 models for source assessment. Unlike other studies we used more complex methods of extraction of references and included more language versions of Wikipedia. Additionally, we used semantic layer to identify source type and metadata to create ranking of the sources in specific domains. We also took into account different periods to compare the reliability indicators of the source in various months and to find the growth leaders. Moreover, models were used to assess references based on publicly available data (Wikimedia Downloads [35]), so anybody can use our models for different purposes.

3. Popularity and Reliability Models of the Wikipedia Sources

In this Section we describe ten models related to popularity and reliability of the sources. In most cases source means domain (or subdomain) of the URL in references. Models are identified with abbreviations:
  • F model—based on frequency (F) of source usage.
  • P model—based on cumulative pageviews (P) of the article in which source appears.
  • PR model—based on cumulative pageviews (P) of the article in which source appears divided by number of the references (R) in this article.
  • PL model—based on cumulative pageviews (P) of the article in which source appears divided by article length (L).
  • Pm model—based on daily pageviews median (Pm) of the article in which source appears.
  • PmR model—based on daily pageviews median (Pm) of the article in which source appears divided by number of the references (R) in this article.
  • PmL model—based on daily pageviews median (Pm) of the article in which source appears divided by article length (L).
  • A model—based on number of authors (A) of the article in which source appears.
  • AR model—based on number of authors (A) of the article in which source appears divided by number of the references (R) in this article.
  • AL model—based on number of authors (A) of the article in which source appears divided by article length (L).
Frequency of source usage in F model means how many references contain the analyzed domain in URL. This method was commonly used in related works [7,8,9,10]. Here we take into account a total number of appearances of such reference, i.e., if the same source is cited 3 times, we count the frequency as 3. Equation (1) shows the calculation for F model.
F ( s ) = i = 1 n C s ( i ) ,
where s is the source, n is a number of the considered Wikipedia articles, C s ( i ) is a number of references using source s (e.q. domain in URL) in article i.
Pageviews, i.e., number of times a Wikipedia article was displayed, is correlated with its quality [33]. We can expect that articles read by many people are more likely to have verified and reliable sources of information. The more people read the article the more people can notice inappropriate source and the faster one of the readers decides to make changes.
P model includes additionally to the frequency of source also cumulative pageviews of the article in which this source appears. Therefore, the source that was mentioned in a reference in a popular article can have bigger value then source that was mentioned even in several less popular articles. Equation (2) presents the calculation of measure using P model.
P ( s ) = i = 1 n C s ( i ) · V ( i ) ,
where s is the source, n is a number of the considered Wikipedia articles, C s ( i ) is a number of references using source s (e.q. domain in URL) in article i, V ( i ) is cumulative pageviews value of article i.
PR model uses cumulative pageviews divided by the total number of the references in a considered article. Unlike the previous model here we take into account visibility of the references using the analyzed source. We assume that in general the more references in the article, the less visible the specific reference is Equation (3) shows the calculation of measure using PR model.
P R ( s ) = i = 1 n V ( i ) C ( i ) · C s ( i ) ,
where s is the source, n is a number of the considered Wikipedia articles, C ( i ) is total number of the references in article i, C s ( i ) is a number of the references using source s (e.q. domain in URL) in article i, V ( i ) is cumulative pageviews value of article i.
Another important aspect of the visibility of each reference is the length of the entire article. Therefore, we provide additional PL model that operates on the principles described in Equation (4).
P L ( s ) = i = 1 n V ( i ) T ( i ) · C s ( i ) ,
where s is the source, n is a number of the considered Wikipedia articles, T ( i ) is the length of source code (wiki text) of article i, C s ( i ) is a number of references using source s (e.q. domain in URL) in article i, V ( i ) is cumulative pageviews value of article i.
Popularity of an article can be measured in different ways. As it was proposed in [26] we decided to measure pageviews also as daily pageviews median (Pm) of individual articles. Thereby we provided additional models Pm, PmR, PmL that are modified versions of models P, PR, PL, respectively. The modification consists in replacement of cumulative pageviews with daily pageviews median.
As the pageviews value of article is more related to readers, we also propose a measure addressing the popularity among authors, i.e., number of users who decided to add content or make changes in the article. Given the assumptions of previous models we propose analogous models related to authors: models A, AR, AL are described in Equations (5)–(7), respectively.
A ( s ) = i = 1 n C s ( i ) · E ( i ) ,
where s is the source, n is a number of the considered Wikipedia articles, C s ( i ) is a number of references using source s (e.q. domain in URL) in article i, E ( i ) is total number of authors of article i.
A R ( s ) = i = 1 n E ( i ) C ( i ) · C s ( i ) ,
where s is the source, n is a number of the considered Wikipedia articles, C ( i ) is total number of the references in article i, C s ( i ) is a number of references using source s (e.q. domain in URL) in article i, E ( i ) is total number of authors of article i.
A L ( s ) = i = 1 n E ( i ) T ( i ) · C s ( i ) ,
where s is the source, n is a number of the considered Wikipedia articles, T ( i ) is the length of source code (wiki text) of article i, C s ( i ) is a number of references using source s (e.q. domain in URL) in article i, E ( i ) is total number of authors of article i.
It is important to note that for pageviews measures connected with sources extracted in the end of the assessed period we use data for the whole period (month). For example, if references were extracted based on dumps as of 1 March 2020, then we considered pageviews of the articles for the whole February 2020.

4. Extraction of Wikipedia References

Wikimedia Foundation back-ups each language version of Wikipedia at least once a month and stores it on a dedicated server as “Database backup dumps”. Each file contains different data related to Wikipedia articles. Some of them contain source codes of the Wikipedia pages in wiki markup, some of them describe individual elements of articles: headers, category links, images, external or internal links, page information and others. There are even files that contain the whole edit history of each Wikipedia page.
Variety of dump files gives possibility to extract necessary data in different ways. Some of them allow to get results in a relatively short time using simple parser. However, other important information may be missing in such files. Therefore, in this section we describe two methods of extracting the data about references in Wikipedia.

4.1. Basic Extraction

References have often links to different external sources (websites). For each language version of Wikipedia we used dump file with external URL link records in order to extract the URLs from rendered versions of Wikipedia article. For instance, for English Wikipedia we used dump file from March 2020-“enwiki-20200301-externallinks.sql.gz”. This file contains data about external links placed in all pages in selected language version of Wikipedia. Therefore, we took into account only links placed in article namespace (ns0). We extracted over 280 million external links from 55 considered language versions of Wikipedia. Table 1 shows the extraction statistics based on dumps from March 2020: total number of articles, number of articles with a certain number of external links (URLs), total and unique number of external links in different language versions of Wikipedia.
Analysis of the external links showed that the largest share of articles with at least one link is placed in Swedish Wikipedia—96%. English Wikipedia has slightly less value of this indicator—about 91% articles with at least 1 external link. However, English Wikipedia has the largest share of articles with at least 100 external links—1% of all articles in this language. The biggest total number of external links per 1 article has Catalan (12.7), English (11.5) and Russian (10.1) Wikipedia.
Based on the extraction of external links, we can find which of the domains (or subdomains) are often used in Wikipedia articles. Figure 1 shows the most popular domains (and subdomains) in over 280 million external links from 55 language versions of Wikipedia.
It is important to note that despite the fact that imdb.com (Internet Movie Database) included in the list of sites which are generally unacceptable in English Wikipedia [2], this resource is on the 2nd planes in the list of the most commonly used websites in Wikipedia articles. The top 10 of the most commonly used websites also contains: web.archive.org (Wayback Machine), viaf.org (Virtual International Authority File), int.soccerway.com (Soccerway-website on football), tvbythenumbers.zap2it.com (TV by the Numbers), animaldiversity.org (Animal Diversity Web), deadline.com (Deadline Hollywood), variety.com (Variety-american weekly entertainment magazine), webcitation.org (WebCite-on-demand archiving service), officialcharts.com (The Official UK Charts Company).
Obtained results can be used for further analysis. However, basic extraction method next to its its relative simplicity, have some disadvantages. For example, we can extract all external links from article using basic extraction method but we will miss information about placement of each link in article (e.q. if it was placed in reference). Another problem is excluding not relevant links such as archived copy of the source (when the original copy in presented and available), links generated automatically if the source has special identifiers or templates, links to other pages of Wikimedia projects (often they show additional information about the article but not the source of information) and others. Therefore, we decided to conduct a more complex extraction based on source code of each Wikipedia article. This method is described in the next subsection.

4.2. Complex Extraction

Using Wikipedia dumps from March 2020, we have extracted all references from over 40 million articles in 55 language editions that have at least 100,000 articles and at least 5 article depth index in recent years as it was proposed in [26]. Complex extraction was based on source code of the articles. Therefore, we used other dump file (comparing to basic extraction)-for example dump file as of March 2020 for English Wikipedia that we used is “enwiki-20200301-pages-articles.xml.bz2”.
In wiki-code references are usually placed between special tags <ref>…</ref>. Each reference can be named by adding “name” parameter to this tag: <ref name=”...”>...</ref>. After such reference was defined in the articles, it can be placed elsewhere in this article using only <ref name=”...” />. This is how we can use the same reference several times using default wiki markup. However, there are other possibilities to do so. Depending on language version of Wikipedia we can also use special templates with specific names and set of parameters. It is not even mandatory that some of them must be placed under <ref>...</ref> tag.
In general, we can divide references into two groups: with special template and without it. In the case of references without special template they usually have URL of source and some optional description (e.g., title). References with special templates can have different data describing the source. Here in separate fields one can add information about author(s), title, URL, format, access date, publisher and others. The set of possible parameters with predefined names depends on language version and type of templates, which can describe book, journal, web source, news, conference and others. Figure 2 shows the most commonly used templates in <ref> tags in English. Among the most commonly used templates in this Wikipedia language versions are: ’Cite web’, ’Cite news’, ’Cite book’, ’Cite journal’, National Heritage List for England (’NHLE’), ’Citation’, ’Webarchive’, ’ISBN’, ’In lang’, ’Dead link’, Harvard citation no brackets (’Harvnb’), ’Cite magazine’. In order to extract information about sources we created own algorithms that take into account different names of reference templates and parameters in each language version of Wikipedia. The most commonly used parameters in this language version are: title, url, accessdate, date, publisher, last, first, work, website and access-date.
It is important to note that the presence of some references cannot be identified directly based on the source (wiki) code of the articles. Sometimes infoboxes or other templates in the Wikipedia article can put additional references to the rendered version of article. Figure 3 shows such situation on example of table with references in the Wikipedia article “2019–2020 coronavirus pandemic” that was added using template “2019–2020 coronavirus pandemic data”. In our approach we include such references in the analysis when such templates appear in the Wikipedia articles.
Some of the most popular templates allows to add identifiers to the source such as DOI, JSTOR, PMC, PMID, arXiv, ISBN, ISSN, OCLC and others. Some references can include special templates related to identifiers such DOI, ISBN, ISSN can be described as separate templates. For example, value for “doi” parameter can be written as “doi|...”. Moreover, some of the templates allow to insert several identifiers for one reference-templates for ISBN, ISSN identifiers allows to put two or more values-for example we can put in code “ISBN|...|...” or “ISSN|...|...|...”. Table 2 shows the extraction statistics of the references with DOI, ISBN, ISSN, PMID, PMC identifiers. Table 3 shows the extraction statistics of the references with arXiv, Bibcode, JSTOR, LCCN, OCLC identifiers.
Special identifiers can determine similarity between the references even though they have different parameters in description (e.g., titles in another languages). Unification of these references can be done based on identifiers. For example, if a reference has DOI number “10.3390/computers8030060”, we give it URL “https://doi.org/10.3390/computers8030060”. More detailed information about identifiers which we used to unifying the references is shown in Table 4.
One of the advantages of the complex method of extraction (comparing to basic one, which was described in previous subsection) is ability to distinguish between types of source URLs: actual link to the page and archived copy. For linking to web archiving services such as the Wayback Machine, WebCite and other web archiving services special template “Webarchive” can be used. In most cases the template needs only two arguments, the archive url and date. This template is used in different languages and sometimes has different names. Additionally, in a single language this template can be called using other names, which are redirects to original one. For example in English Wikipedia alternative names of this templates can be used: “Weybackdate”, “IAWM”, “Webcitation”, “Wayback”, “Archive url”, “Web archive” and others. Using information from those templates we found the most frequent domains of web archiving services in references.
It is important to note that depending on language version of Wikipedia template about archived URL addresses can have own set of parameters and own way to generate final URL address of the link to the source. For example, in the English Wikipedia template Webarchive has parameter url which must contain full URL address from web archiving service. At the same time related template Webarchiv in German Wikipedia has also other ways to define a link to archived source-one can provide URL of the original source page (that was created before it was archived) using url parameter and (or) additionally use parameters depending on the archive service: “wayback”, “archive-is”, “webciteID” and others. In this case, to extract the full URL address of the archived web page, we need to know how inserted value of each parameter affects the final link for the reader of the Wikipedia article in each language version.
In the extraction we also took into account short citation from “Harvard citation” family of templates which uses parenthetical referencing. These templates are generally used as in-line citations that link to the full citation (with the full meta data of the source). This enables a specific reference to be cited multiple times having some additional specification (such as a page number) with other details (comments). We included in the analysis following templates: “Harvnb” (Harvard citation), “harvnb” (Harvard citation no brackets), “Harvtxt” (Harvard citation text), “Harvcol”, “Harvcolnb”, “Sfn” (Shortened footnote template) and others. Depending on language version of Wikipedia, each template can have another corresponding name and additional synonymous names. For example in English Wikipedia, “Harvard citation”, “Harv” and “Harvsp” mean the same template (with the same rules), while corresponding template in French has such names as “Référence Harvard”, “Harvard” and also “Harv”.
Taking into account unification of URLs based on special identifiers, excluding URLs of archived copies of the sources and including special templates outside <ref> tags, we counted the number of all and unique references in each considered language version. Table 5 presents total number of articles, number of articles with at least 1 reference, at least 10 references, at least 100 references and number of total and unique number of references in each considered language version of Wikipedia.
Analysis of the numbers of the references extracted by complex extraction showed other statistics comparing to basic extraction of the external links described in Section 4.1. The largest share of the article with at least one references has Vietnamese Wikipedia-84.8%. Swedish, Arabic, English and Serbian Wikipedia has 83.5%, 79.2%, 78.2% and 78.1% share of such articles, respectively. If we consider only articles with at least 100 references, then the largest share of such articles will have Spanish Wikipedia-3.5%. English, Swedish and Japanese Wikipedia has 1.1%, 0.9% and 0.8% share of such articles, respectively. However, the largest total number of the references per number of articles has English Wikipedia—9.6 references. Relatively large number of references per article has also Spanish (9.2) and Japanese (7.1) Wikipedia.
The largest number of the references with DOI identifier has English Wikipedia (over 2 million) at the same time has the largest number of average number of references with DOI per article—34.3%. However, the largest share of the references with DOI among all references has Galician (8.4%) and Ukrainian (6.6%) Wikipedia.
The largest number of the references with ISBN identifier has English Wikipedia (over 3.5 million) at the same time has the largest number of average number of references with ISBN per article-34.3%. However, the largest share of the references with ISBN among all references has Kazakh (20.3%) and Belarusian (13.1%) Wikipedia.
Based on the extraction of URLs from the obtained references, we can find which of the domains (or subdomains) are often used in Wikipedia articles. Figure 4 shows the most popular domains (and subdomains) in over 200 million references of Wikipedia articles in 55 language versions. Comparing results with basic extraction (see Section 4.1) we got some changes in the top 10 of the most commonly used sources in references: deadline.com (Deadline Hollywood), tvbythenumbers.zap2it.com (TV by the Numbers), variety.com (Variety-american weekly entertainment magazine), imdb.com (Internet Movie Database), newspapers.com (historic newspaper archive), int.soccerway.com (Soccerway-website on football), web.archive.org (Wayback Machine), oricon.co.jp (Oricon Charts), officialcharts.com (The Official UK Charts Company), gamespot.com (GameSpot-video game website).

5. Assessment of Sources

To assess the references based on prooped models apart from extraction of the source we also extracted data related to pageviews, lenght of the articles and number of the authors. We used different dumps files that are available on “Wikimedia Downloads” [35].
Based on complex extraction method we measure popularity and reliability of the sources in references. Due to limitation of the size in this paper we often used F or PR model to show various ranking of sources. The exception is situations where we compared 10 proposed models for popularity and reliability assessment of the sources in Wikipedia. Additionally in the tables we limit number of the languages to one of the most developed: Arabic (ar), German (de), English (en), Spanish (es), Persian (fa), French (fr), Italian(it), Japanese(ja), Dutch (nl), Polish (pl), Portuguese (pt), Russian (ru), Swedish (sv), Vietnamese (vi), Chinese (zh). The more extended version of the results are placed on the web page: http:/data.lewoniewski.info/sources/. For example, figures that shows the most popular and reliable sources for each of considered language version of Wikipedia using F-model (http://data.lewoniewski.info/sources/modelf) and PR-model (http://data.lewoniewski.info/sources/modelpr) placed there.
Table A1 shows position in the local rankings of the most popular and reliable sources in one of the most developed language versions of Wikipedia in February 2020 using PR model. In this table it is possible to compare rank of the source that has leading position in at least one language version to other languages. For example, “taz.de” (Die Tageszeitung) is on 3rd place in German Wikipedia in February 2020, at the same time this source is on 692nd, 785th and 996th place in French, Persian and Polish Wikipedia respectively in the same period. In French Wikipedia the most reliable source in February 2020 was “irna.ir” (Islamic Republic News Agency), at the same time in English Wikipedia it is on 8072nd place. However this source not mentioned at all in Polish and Swedish Wikipedia. Other example-in Russian Wikipedia the most reliable source in February 2020 “lenta.ru” was on the 1st place, at the same time it is on the 166th, 310th, 325th and 352nd in Polish, Vietnamese, German and Arabic Wikipedia. There also sources, that has relatively high position in all language versions: “variety.com” and deadline.com always in the top 20, “imdb.com” almost in all languages (except Japanese) in the top 20, ’who.int’ in the top 100 of reliable sources in each considered languages.

6. Similarity of Models

According to the results presented in the previous section, each source can be placed on a different position in the ranking of the most reliable sources depending on the model. It is worthwhile to check how similar are the results obtained by different models. For this purpose we used Spearman’s rank correlation to quantify, in a scale from −1 to 1 degree, which variables are associated. Initially we took only sources that appeared in the top 100 in at least one of the rankings of the most popular and reliable sources in multilingual Wikipedia in February 2020. Altogether, we obtained 180 sources and their positions in each of the rankings. Table 6 shows Spearman’s correlation coefficients between these rankings.
We can observe that the highest correlation is between rankings based on P and Pm model–0.99. This can be explained through similarities of the measures in models—the first is based on cumulative page views and the latter on median of daily page views in a given month.
Another pair of similar rankings is PL and PR models—0.98. Both measures use total page views data. In the first model value of this measure is divided by the number of references, in the second by article length. As we mentioned before in Section 2 and Section 3, the number of references and article lengths are very important in quality assessment of the Wikipedia articles and are also correlated—we can expect that longer articles can have a bigger number of references.
In connection with previously described similarities between P and Pm, we can also explain similarity between models PL and PmR with 0.97 value of the Spearman’s correlation coefficient.
The lowest similarity is between F and P model–0.37. It comes from different nature of these measures. In Wikipedia anyone can create and edit content. However, not every change in the Wikipedia articles can be checked by a specialist in the field, for example by checking reliability of the inserted sources in the references. Despite the fact that some sources are used frequently, there is a chance that they have not been verified yet and not replaced by more reliable sources. The next pair of rankings with the low correlation is Pm and F model. Such low correlation is obviously connected with similarity of the page view measures (P and Pm).
It is also important to note the low similarity between rankings based on AR and P models–0.41. Such differences can be connected with the measures that are used in these models. AR model uses the number of authors for whole edition history of article divided by the number of references whereas P uses page view data for selected month.
In the second iteration we extended the number of sources to top 10,000 in each ranking of the most popular and reliable sources in multilingual Wikipedia in February 2020. We obtained 19,029 sources. Table 7 shows Spearman’s correlation coefficients between these extended rankings.
In case of extended rankings (top 10,000) there are no significant changes with regard to the the Spearman’s correlation coefficient values compared to the top 100 model in Table 6. However, it should be noted that the largest difference in values of coefficients appears between PR and A model–0.26 (0.82 in the top 100 and 0.56 in the top 10,000).
The heatmap in Figure 5 shows Spearman’s correlation coefficients between rankings of the top 100 most reliable sources in each language version of Wikipedia in February 2020 obtained by F-model in comparison with other models.
Comparing the results of Spearman’s correlation coefficients within each of considered language version of Wikipedia, we can find that the largest average correlation between F-model and other models is for Japanese (ja) and English (en) Wikipedia—0.61 and 0.59, respectively. The smallest average value of the correlation coefficients among languages have Catalan (ca) and Latin (la) Wikipedia—0.16 and 0.19, respectively. Considering coefficient values among all languages of each pair F-model and other model, the largest average value has F/AL-model pairs (0.71), the smallest—F/PmR-models (0.18).

7. Classification of Sources

7.1. Metadata from References

Based on citation templates in Wikipedia we are able to find more information about the source: authors, publication date, publisher and other. Using such metadata we decided to find which of the publishers and journals are most popular and reliable.
We first analyzed values of the publisher parameter in citations templates of the references of articles in English Wikipedia (as of March 2020). We found over 18 million references with citation templates that have value in the publisher parameter. The Figure 6 shows the most commonly used publishers based on such analysis.
Within the parameter publisher in references, the following names are most often found: United States Census Bureau, Oxford University Press, BBC, BBC Sport, Cambridge University Press, Routledge, National Park Service, AllMusic, Yale University Press, BBC News, Prometheus Global Media, United States Geological Survey, ESPN, CricketArchive, International Skating Union, Official Charts Company.
Using different popularity and reliability models we assessed all journals based on the related parameter in citation templates placed in references of English Wikipedia. Table 8 shows the most popular and reliable publishers with position in the ranking depending on the model.
Comparing the differences between ranking positions of the publishers using different models, we observed that some of the sources always have leading position: Oxford University Press (1st or 2nd place depending on model), BBC (2nd-5th place), Cambridge University Press (2nd-5th place), Routledge (3rd-6th place), BBC News (5th-10th place).
Some of the publisher has a high position in few models. For example, “United States Census Bureau” has the 1st place in F model (frequency) and AR model (authors per references count). At the same time in P (pageviews) model and PL model (pageviews per length of the text), this source took 27th and 11th places, respectively. Another one of the most frequent publisher in Wikipedia-’National Park Service’ took 7th place. However it took only 94th and 58th places in P (pageviews) and PmL (pageviews median per length of the text) models, respectively. Publisher “Springer” took 5th place in PmR model (pageviews median per references count), but took only 19th place in F model (frequency). CNN took 2nd place in P (pageviews) and Pm (pageviews median) model, but at the same time took 22nd and 16th places in F (frequency) and AR (authors per references count) model, respectively. Wikimedia Foundation as a source in P (pageviews) model is in the top 10 sources, but at the same time is far from leading position in F (frequency) and AR (authors per length of the text) model—5541st and 3008th places, respectively.
It is important to note, that this ranking of publishers only take into account references with filled publisher parameter in citation templates in English Wikipedia, therefore it can not show complete information about leading sources in different languages (especially in those languages where citation templates are used rarely used).
Next we extracted values of journal parameter in citations templates of the references from articles in English Wikipedia. We found over 3 million references with citation templates that have value in the journal parameter. The Figure 7 shows the most commonly used journals based on such analysis. The most commonly used journals were: Nature, Astronomy and Astrophysics, Science, The Astrophysical Journal, Lloyd’s List, PLOS ONE, Monthly Notices of The Royal Astronomical Society, The Astronomical Journal, Billboard.
Using different popularity and reliability models we assessed all journals based on the related parameter in citation templates placed in references of English Wikipedia. Table 9 shows the most popular and reliable journals with position in the ranking depending on the model. It is important to note that the same journal has two different names “Astronomy and Astrophysics” and “Astronomy & Astrophysics” because it was written in such ways in citation templates.
Comparing the differences between ranking positions of the journals using different models, we can also observe that some of the sources always have leading position: Nature (1st in all models), Science (2nd-3rd place depending on model), PLOS ONE (3rd-6th place), The Astrophysical Journal (4th-7th place).
Some of journals has a high position in few models. For example, “Lancet” journal took 3rd place in P (pageviews) and Pm (pageviews median) model, but is only on the 23rd place in F (frequency) model. Another example, “Proceedings of the National Academy of Sciences of the United States of America” has the 4th place in PmR model (pageviews median per references count) and at the same time 13th place in F (frequency) model. “Proceedings of The National Academy of Sciences” took 8th place in PmR model (pageviews per references count), but has 18th position in F model (frequency). There are journals that have signifficatly fidderent position depends on model. One of the good examples—“MIT Technology Review” that took 5th place in P model (pageviews), but only 5565th and 3900th places in F (frequency) and AR (authors count per references count) model, respectively.
Despite the fact that obtained results allow us to compare different meta data related to the source, we need to take into account significant limitation of this method-we can only assess the sources in references that used citation templates. Additionally, as we already discussed in Section 4.2, not always related parameters of the references are filled by Wikipedians. Therefore, we decided to take into account all references with URL address and conducted more complex analysis of the source types based on semantic databases.

7.2. Semantic Databases

Based on information about URL it is possible to identify title and other information related to the source. Using Wikidata [37,38] and DBpedia [39,40] we found over 900 thousand items (including such broadcasters, periodicals, web portals, publishers and other) which has aligned separate domain(s) or subdomain(s) as official site. Table 10 shows position in the global ranking of the most popular and reliable source with identified title based on found items in 55 considered language versions of Wikipedia in February 2020 using different models with identified title of the source
Leading positions in various models are occupied by following sources: Deadline Hollywood, TV by the Numbers, Variety, Internet Movie Database. “Forbes”, “The Washington Post”, “CNN”, “Entertainment Weekly”, “Oricon” are in the top 20 of all rankings in Table 10. We can also observe sources with relative big differences in rankings between the models. For example, “Newspapers” (historic newspaper archive) in on the 5th place of the most frequent used sources in Wikipedia, at the same time is on 33rd and 23rd place in Pm (pageviews median) and PmL (pageviews median per length of the text) models respectively. Another example, “Soccerway” is on the 7th place in the ranking of the most commonly used sources (based on F model), but is on 116th and 100th places in P and Pm models, respectively. Despite the fact that “American Museum of Natural History” is on top 20 the most commonly used sources in Wikipedia (based on F model), it is excluded from top 5000 in P (pageviews), Pm (pageviews median), PmR (pageviews median per reference count) and PmL ((pageviews) median per length of text) models.
Table 11 shows the most popular and reliable types of the sources in selected language versions of Wikipedia in February 2020 based on PR model. In almost all language versions websites are the most reliable sources. Magazines and business related source are top 10 of the most reliable types of sources in all languages. Film databases are one of the most reliable sources in Arabic, French, Italian, Polish and Portuguese Wikipedia. In other languages such sources are placed above 19th place. Arabic, English, French, Italian and Chinese Wikipedia preferred newspapers as a reliable source more than in other languages that placed such sources lower in the ranking (but above the 14th place). News agencies are more reliable for Persian Wikipedia comparing with other languages. Government agencies as a source has much more reliability in Persian and Swedish Wikipedia than in other languages. Holding companies provides more reliable information for Japanese and Chinese languages. In Dutch and Polish Wikipedia archive websites has relatively higher position in the reliability ranking. Periodical sources are more reliable German, Spanish and Polish Wikipedia. Review aggregators are more reliable in Arabic and Polish Wikipedia comparing other considered languages. Television networks in on 7th place in German Wikipedia and on 14th place in Portuguese Wikipedia, while other languages have such sources even on lower then 20th place (even 125th place). Social networking services are placed in top 20 of the most reliable types of sources in Japanese, Polish and Chinese Wikipedia. Weekly magazines are in the top 10 of English, Italian, Portuguese and Russian Wikipedia.
Based on the knowledge about type of each source we decided to limit the ranking to specific area. We chosen only periodical sources which aligned to one of the following types: online newspaper (Q1153191), magazine (Q41298), daily newspaper (Q1110794), newspaper (Q11032), periodical (Q1002697), weekly magazine (Q12340140). The top of the most reliable periodical sources in all considered language versions in Wikipedia in February 2020 occupies: Variety, Entertainment Weekly, The Washington Post, USA Today, People, The Indian Express, The Daily Telegraph, Time, Pitchfork, Rolling Stone.
The most popular periodical sources in Wikipedia articles from 55 language versions using different popularity and reliability models in February 2020 showed in Table 12. There are sources that have stable reliability in all models–“Variety” has always 1st place, “Entertainment Weekly” 2nd-3nd place, “The Washington Post” occupies 2nd-4th place, “USA Today” took 4th-5th place depending on the model. Despite the fact that “Lenta.ru” is the 6th most commonly used periodical source in different languages of Wikipedia (using F model), it is placed in 21st and 19th places using P and Pm models, respectively. “The Daily Telegraph” is in the top 10 most reliable periodical sources in all models. “People” is in 18th place in frequency ranking, but at the same time took 4th place in the PmR model.
Given local rankings of periodical we can consider the difference of reliability and popularity between different language versions. Table A2 shows the position in local rankings of periodical sources in different language versions of Wikipedia in February 2020 using PR model. Almost in all considered languages (except Dutch) “Variety” took 1st-4th places in local rankings of the most reliable periodical sources. Some sources that are in leading positions in local rankings are not presentet at all as a sources in some languages. For example. “Aliqtisadi” (Arabic news magazine) is in the 2nd place in Arabic Wikipedia, but in English, Persian, Italian, Japanese, Russian Wikipedia position this source is lower then 600th place and not presented in other language as a source. Similar tendencies is to “Ennahar newspaper”, which has 5th place in Arabic Wikipedia. For the German Wikipedia 2nd, 3rd and 4th place belongs to “Die Tageszeitung”, “DWDL.de”, “Auto, Motor und Sport”. For Spanish Wikipedia leading local periodical sources are: “20 minutos”, “El Confidencial”, “Entertainment Weekly”, “¡Hola!”. In Persian Wikipedia one of the most reliable periodical source “Donya-e-Eqtesad”, that is not presented at all in most of the considered languages. The most reliable sources in French Wikipedia include: “Le Monde”, “Jeune Afrique”, “Le Figaro”, “Huffington Post France”. Italian version of Wikipedia contains such the most reliable local sources as: “la Repubblica”, “Il Post”, “Il Fatto Quotidiano”. In Japan Wikipedia leading reliable sources includes “Nihon Keizai Shimbun”, “Tokyo Sports”, “Yomiuri Shimbun”. Dutch Wikipedia contains “De Volkskrant”, “Algemeen Dagblad”, “Het Laatste Nieuws”, “Trouw”, “NRC Next” as one of the most reliable periodical sources. Polish Wikipedia has “Wprost” and “TV Guide” in top 3 periodical sources. In Portuguese one of the most reliable periodical sources are “Veja” and “Exame”. “Lenta.ru” and “Komsomolskaya Pravda” are leading periodical sources in Russian Wikipedia. Swedish language version has “Sydsvenskan”, “Dagens Industri” and “Helsingborgs Dagblad” as leading reliable sources. “VnExpress” took 1st place in the most reliable periodical sources of Vietnamese Wikipedia. “Apple Daily” is the most reliable periodical source in Chinese language version.

8. Temporal Analysis

Using complex extraction of the references apart from data from February 2020, we also used dumps from November 2019, December 2019 and January 2020. Based on those data we measure popularity and reliability of the sources in different months.
Table 13 shows position in rankings of popular and reliability sources with identified title depending on period in all considered languages versions of Wikipedia using PR model. Results showed that some of the sources didn’t changes their position in the ranking based on PR model. This is especially applicable to sources with leading position. For example “Deadline Hollywood”, “Variety”, “Entertainment Weekly”, “Rotten Tomatoes”, “Oricon” in each of the studied month he occupied the same place in top 10. “Internet Movie Database” and “TV by the Numbers” exchanged 3rd and 4th places. This is due to the fact that in absolute values of popularity and reliability measurement obtained using PR model, most of these sources have significant breaks from the closest competitors.
Next we decided to limit the list of the sources to periodical ones (as it was done in Section 7.2). Table 14 shows position in rankings of popular and reliable sources depending on period in all considered languages versions using PR model. Similarly to the previous table, we can observe not significant changes in position for the leading sources. In four considered months the top 10 most reliable periodical sources always included: “Variety”, “Entertainment Weekly”, “The Washington Post”, “People”, “USA Today”, “The Indian Express”, “The Daily Telegraph” “Pitchfork”, “Time”.
Results showed, that in the case of periodical sources we have less “stability” of the position in the ranking between different months comparing to the general ranking. For reasons already explained, the 2 top sources (Variety and Entertainment Weekly) did not change their positions. Additionally we can distinguish The Daily Telegraph with stable 7th place during whole considered period of time. Nevertheless in top 10 the most popular and reliable periodical sources of Wikipedia we can observe minor changes in positions. This applies in particular to People, Pitchfork, The Washington Post, USA Today, The Indian Express, Time. those sources grew or fell by 1-2 positions in the top 10 ranking during the November 2019-February 2020.
As it was mentioned before, minor changes in the ranking of sources during the considered period are mainly due to a large margin in absolute values of popularity and reliability measurement. This applies in particular to leading sources. However, what if there are relatively new sources that have significant prerequisites to be leaders or even outsiders in nearest future. The next section will describe the method and results of measuring.

9. Growth Leaders

The Wikipedia articles may have a long edition history. Information and sources in such articles can be changed many times. Moreover, criteria for reliability assessment of the sources can be changed over time in each language version of Wikipedia. Based on the assessment of the popularity and reliability of each source in Wikipedia in certain period of time (month) we can compare the differences between the values of the measurement. This can help to find out how popularity and reliability were changed (increase or decrease) in a particular month. For example, a certain Internet resource has only recently appeared and people have actively begun to use it as a source of information in Wikipedia articles. Another example: a well known and often used website in Wikipedia references dramatically lost confidence (reputation) as a reliable source, and editors actively start to replace this source with another or place additional reference next to existing ones. First place in such ranking means, that for the selected source we observed the largest growth of the popularity and readability score comparing previous month.
Table 15 shows which of the periodical sources had the largest growth of reliability in selected languages and period of times based on F model. For this table we have chosen only sources which was placed at least in top 5 in the growth leaders ranking of the one of the languages and selected month. Results shows that there is no stable growth leaders for the sources when we comparing different periods of time.
F model showed how many references in Wikipedia articles contain specific sources. Therefore, we can analyze which of the sources was more often added in references in Wikipedia articles in the considered month. For example in December 2019 “Die Tageszeitung” and “Handelsblatt” were leading growing sources in German Wikipedia, “Jeune Afrique” and “Les Inrockuptibles” were leading growing sources in French Wikipedia, “Komsomolskaya Pravda” and “Lenta.ru” were leading growing sources in Russian Wikipedia. In next month (January 2020) “Süddeutsche Zeitung” and “Die Tageszeitung” were leading growing sources in German Wikipedia, “Variety” and “La Montagne” were leading growing sources in French Wikipedia, “Variety” and “Komsomolskaya Pravda” were leading growing sources in Russian Wikipedia. In the last considered month (February 2020) “Die Tageszeitung” and “Variety” were leading growing sources in German Wikipedia, “Jeune Afrique” and “La Montagne” were leading growing sources in French Wikipedia, “Sport Express” and “Variety” were leading growing sources in Russian Wikipedia.
Table 16 shows which of the sources had the largest growth of reliability in different languages and period of times based on PR model. For this table we also have chosen only sources which was placed at least in top 5 in the growth leaders ranking of the one of the languages and selected month. Results showed also that there is no stable growth leaders for the sources when we comparing different period of time.
PR model showed how many references in Wikipedia articles contains specific sources with taking into account popularity of the articles. Results showed that in December 2019 Variety and Deutsche Jagd-Zeitung were leading growing reliable sources in German Wikipedia, Variety and Entertainment Weekly were leading growing reliable sources in French Wikipedia, “Lenta.ru” and Entertainment Weekly were leading growing sources in Russian Wikipedia. In next month (January 2020) “Die Tageszeitung” and “DWDL.de” were leading growing sources in German Wikipedia, “Les Inrockuptibles” and “Le Monde” were leading growing sources in French Wikipedia, Variety and “Lenta.ru” were leading growing sources in Russian Wikipedia. In the last considered month (February 2020) “la Repubblica” and “Algemeen Dagblad” were leading growing sources in German Wikipedia, “Atlanta” (magazine) and “Le Figaro étudiant” were leading growing sources in French Wikipedia, New York Post and “Novosti Kosmonavtiki” were leading growing sources in Russian Wikipedia.

10. Discussion of the Results

This study describes different models for popularity and reliability assessment of the sources in different language version of Wikipedia. In order to use these models it is necessary to extract information about the sources from references and also measures related to quality and popularity of the Wikipedia articles. We observed that depending on the model positions of the websites in the rankings of the most reliable sources can be different. In language versions that are mostly used on the territory of one country (for example Polish, Ukrainian, Belarusian), the highest positions in such rankings are often occupied by local (national) sources. Therefore, community of editors in each language version of Wikipedia can have own preferences when a decision is made to enable (or disable) the source in references as a confirmation of the certain fact. So, the same source can be reliable in one language version of Wikipedia, while the community of editors of another language may not accept it in the references and remove or replace this source in an article.
The simplest of the proposed models in this study was based on frequency of occurrences, which is commonly used in related studies. Other 9 novel models used various combinations of measures related to quality and popularity of Wikipedia articles. We provided analysis on how the results differ depending on the model. For example, if we compare frequency-based (F) rankings with other (novel) in each language version of Wikipedia, then the highest average similarity will have AL-model (0.71 of rank correlation coefficient), the least – PmR-model (0.18 of rank correlation coefficient).
The analysis of sources was conducted in various ways. One of the approaches was to extract information from citation templates. Based on the related parameter in references of English Wikipedia we found the most popular publishers (such as United States Census Bureau, Oxford University Press, BBC, Cambridge University Press). The most commonly used journals in citation templates were: Nature, Astronomy and Astrophysics, Science, The Astrophysical Journal, Lloyd’s List, PLOS ONE, Monthly Notices of The Royal Astronomical Society, The Astronomical Journal, Billboard. However, such approach was limited and did not include references without citation templates. Therefore, we decided to use semantic databases to identify the sources and their types.
After obtaining data about types of the sources we found that magazines and business-related sources are in the top 10 of the most reliable types of sources in all considered languages. However, the preferred type of source in references depends on language version of Wikipedia. For example, film databases are one of the most reliable sources in Arabic, French, Italian, Polish and Portuguese Wikipedia. In other languages such sources are placed below 19th place.
Including data from Wikidata and DBpedia allowed us to find the best sources in specific area. Using information about the source types and after choosing only periodical ones, we found that there are sources that have stable reliability in all models - “Variety” has always 1st place, “Entertainment Weekly” 2nd-3nd place, “The Washington Post” occupies 2nd-4th place, “USA Today” took 4th-5th place depending on the model. Despite the fact that “Lenta.ru” is the 6th most commonly used periodical source in different languages of Wikipedia (using F model), it is placed on 21st and 19th place using P and Pm models respectively. “The Daily Telegraph” is in the top 10 most reliable periodical sources in all models. “People” is on 18th place in the frequency ranking but at the same time took 4th place in PmR model.
Using complex extraction of the references in addition to data from February 2020 we also used dumps from November 2019, December 2019, and January 2020. Based on those data we measured popularity and reliability of the sources in different months. After limiting the sources to periodicals we found that in four considered months the top 10 most reliable periodical sources in multilingual Wikipedia always included: “Variety”, “Entertainment Weekly”, “The Washington Post”, ”People”, “USA Today”, “The Indian Express”, “The Daily Telegraph”, “Pitchfork”, and “Time”. Minor changes in the ranking of sources appearing during the considered period are mainly due to a large margin in absolute values of popularity and reliability measurement.
Different approaches assessing reliability of the sources presented in this research contribute to a better understanding which references are more suitable for specific statements that describe subjects in a given language. Unified assessment of the sources can help in finding data of the best quality for cross-language data fusion. Such tools as DBpedia FlexiFusion or GlobalFactSync Data Browser [41,42] collect information from Wikipedia articles in different languages and present statements in a unified form. However, due to independence of edition process in each language version, the same subjects can have similar statements with various values. For example, population of the city in one language can be several years old, while other language version of the article about the same city can update this value several times a year on a regular basis along with information about the source. Therefore, we plan to create methods for assessing sources of such conflict statements in Wikipedia, Wikidata and DBpedia to choose the best one. This can help to improve quality in cross-language data fusion approaches.
Proposed models can also help to assess the reliability of sources in Wikipedia on a regular basis. It can support understanding preferences of the editors and readers of Wikipedia in particular month. Additionally, it can be helpful to automatically detect sources with low reliability before user will insert it in the Wikipedia article. Moreover, results obtained using the proposed models may be used to suggest Wikipedians sources with higher reliability scores in selected language version or selected topic.

10.1. Effectiveness of Models

In this section we present the assessment of the models’ effectiveness. Python algorithms prepared for purposes of this study were tested on desktop computer with Intel Core i7-5820K CPU and SSD hard drive. Algorithms used only one thread of the processor. Due to the fact that each model used own set of measures, we divided assessment into several stages, including extracting of:
  • External links using basic extraction method on compressed gzip dumps with total volume 12 GB-0.28 milliseconds per article on average.
  • Sources from references using complex extraction method on bzip2 dumps with total volume 64 GB-2 milliseconds per article on average.
  • Text length of articles (as a number of characters) using compressed bzip2 dumps with total volume 64 GB-0.68 milliseconds per article on average.
  • Total page views for considered month using compressed bzip2 dumps with total volume 12 GB-0.25 milliseconds per article on average.
  • Median of daily page views for considered month using compressed bzip2 dumps with total volume 12 GB-0.26 milliseconds per article on average.
  • Number of authors of articles using compressed bzip2 dumps with total volume 170 GB-1.12 milliseconds per article on average.
Given the above and the fact we can calculate the effectiveness for each model during conversion, time the algorithm needs to calculate the popularity and reliability of the source is as follows:
  • F model: 2 milliseconds per article.
  • P, PR model: 2.25 milliseconds per article.
  • Pm, PmR model: 2.28 milliseconds per article.
  • PL model: 2.93 milliseconds per article.
  • PmL model: 2.94 milliseconds per article.
  • A, AR model: 3.12 milliseconds per article.
  • AL model: 3.8 milliseconds per article.

10.2. Limitations

Reliability as one of the quality dimensions is a subjective concept. Each person can have their own criteria to asses reliability of the gives sources. Therefore each Wikipedia language community can have its own definition of reliable source. Only English Wikipedia, as the most developed edition of this free encyclopedia, provided an extended list of reliable/unreliable sources [43]. However it not always been used-for example despite the fact that IMDb (Internet Movie Database) is market as ‘Generally unreliable’ it is used very often (see Figure 4 or Table A1). As we observed, in some cases such sources can be used in references with some limitations—it can describe some specific statements (but not all). Therefore additional analysis of the placement of such sources in the articles can help to find such limited areas, where some sources can be used.
In the study we proposed and used 10 models to assess the popularity and reliability of the sources in Wikipedia. Each of the model use some of the important measures related to content popularity and quality. However, there are other measures that have potential to improve presented approach. Therefore we plan to extend the number of such measures in model. We plan to analyze possibility of comparing the results with other approaches or lists of the sources. For example it can be the most popular websites based on special tools, or reliable sources according to selected standards in some countries.
Each of the model can have own weak and strong sides. For example, during the experiments we observed, that some of articles has overstated values of the page views in some languages in selected months. This can be deduced from other related measures of the article. Sources in such articles could get extra points. However, these were individual cases that did not significantly affect the results of the work. In future work we plan to provide additional algorithms to automatically find and reduce such cases.
To extract the sources from references, which usually are published as of the first day of each month. We have information only for specified timestamp of the articles and we do not analyze in what day the source was inserted (or deleted) in the Wikipedia article. If the source was inserted few minutes (seconds) before the process of creating dumps files was started, we will count it as it was presented during the last considered month. Moreover, it can be more negatively involve on the model if such source was deleted few minutes (seconds) after the dump creating was begun. In other words, if the reference with the specified source was inserted and deleted around the timestamp of dump files creation, it can slightly or strongly (depend on values of article measures) falsify the results of some of the models. Therefore, more detailed analysis of each edition of the article can help to find how long particular reference was presented in article.

11. Conclusions and Future Work

In this paper we used basic and complex extraction methods to analyze over 200 million references in over 40 million articles from multilingual Wikipedia. We extracted information about the sources and unified them using special identifiers such as DOI, JSTOR, PMC, PMID, arXiv, ISBN, ISSN, OCLC and other. Additionally we used information about archive URL and included templates in the articles.
We proposed 10 models in order to assess popularity and reliability of websites, news magazines and other sources in Wikipedia. We also used DBpedia and Wikidata to automatically identify the alignment of the sources to specific field. Additionally, we analyzed the differences of popularity and reliability assessment of the sources between different periods. Moreover, we also conducted analysis of the growth leaders in each considered month. Results showed that depending on model and time some of the source can have different directions and power of changes (rise or fall). Next, we compared the similarity of rankings that used different models.
Some of extended results on reliability assessment of the sources in Wikipedia are placed in BestRef project [44].
In addition to what has already been described in the Section 10.2, in future work we plan to extend the popularity and reliability model. One of the directions is to take into account the position of the inserted reference in article and in list of the references. Next we plan to take into account features of the articles related to Wikipedia authors such as reputation or number of article watchers.
In this work we showed how it is possible to measure growth of the popularity and reliability of the sources based on differences in the Wikipedia content from several recent months. In our future research we plan to extend the time series to have more information about growth leaders in different years in each language version of Wikipedia.
Information about reliability of the sources can help to improve models for quality assessment of the Wikipedia articles. This can be especially useful to estimate sources of conflict statements between language versions of Wikipedia in articles related to the same subject. Additionally, one of the promising direction of the future work is to create methods for suggesting Wikipedia authors reliable sources for selected topics and statements in separate languages of Wikipedia.

Author Contributions

Conceptualization, W.L. and K.W.; methodology, W.L; software, W.L.; validation, W.L. and K.W.; formal analysis, K.W. and W.A.; investigation, W.L.; resources, W.A.; data curation, W.L.; writing–original draft preparation, W.L.; writing–review and editing, K.W. and W.A.; visualization, W.L.; supervision, K.W. and W.A.; project administration, K.W. and W.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Position in Local Rankings

Table A1. Position in the local rankings of the most popular and reliable sources in different language versions of Wikipedia in February 2020 using PR model. Source: own calculations based on Wikimedia dumps using complex extraction of references. Extended version of the table is available on the web page: http://data.lewoniewski.info/sources/a1.
Table A1. Position in the local rankings of the most popular and reliable sources in different language versions of Wikipedia in February 2020 using PR model. Source: own calculations based on Wikimedia dumps using complex extraction of references. Extended version of the table is available on the web page: http://data.lewoniewski.info/sources/a1.
SourceLanguage Version of Wikipedia
ardeenesfafritjanlplptrusvvizh
ad.nl416916663311,6636153108659712737319924003716113,152214212,739
adorocinema.com418917,03037311402-13,20417,889899016,592141215,00320,774575725,859
allocine.fr205139092921389012565176723231586148896351748184491
almaany.com323,568524927,303391459218,09821,354--10,374720992413,55232,987
appledaily.com.tw726024,734391731,35414,79443,41142,064840--410331,323-4262
cand.com.vn26,76880,00347,951--------75,342-318,821
deadline.com72112128115122051
dn.se23120731021742255765201131301223156121651882111091818
dwdl.de13865135919,6528051801271626,0425155457927,22132,027544811,97632,793
eiga.com271977454521609391920003130322,46415289262863546317433
elcinema.com123,353462838,2431744158525,52440,04512,26614,81735,23212,767734115,56326,656
expressen.se1392557300137982633894876097505545973883230111724
formulatv.com11211866795570532320259,42422,6955733248117125,33224,83732,378
hln.be20523577181717,37915,411147124,54855,13342069524117,06324,76340854307
ibge.gov.br-18,76113,2842115-19,876--7030-4427522,550290238,937
imdb.com24444713441248641513
infoescola.com14,81849,87217,542997-30,47611,193--7107544,20124,94555396575
irna.ir180666,843807220,057138,80366,34242,35017,815-16,45621,773-11,50317,543
kp.ru3177180987466253459241977933563548063413,0054591522361395
lenta.ru352325462930119248012547852363166134211578310676
lesinrocks.com19412308100416001399385960692301949738173074903238042401
mobot.org6862125,005433755211,2034969521010,80567342109537,40113,18693012,005
news.livedoor.com252931,8031628296711,697963213,4475-24,05710,329696528,94438898
news.mynavi.jp15225110139412,368426815,86516,9394-40,700388011,560718041045
nikkei.com319310966945571790385414022197740311524387012,83283664
oricon.co.jp2263606016768612134712606911312041115223
regeringen.se956612,561478921,114506568,510-64,85517,46833,056471125,8675301745,773
repubblica.it41320517326024031363118866234884540712211064466
research.amnh.org49,40049,86616,30413,141-28,28724,255-1410,29324,0653317-224,727
rottentomatoes.com16105918111950446971093014
scb.se336124877735188542800143916,38862123117591629312341739
skijumping.pl41,59458669,49316,664-25,73112,91962,18613,862351,61223,7635186-42,126
taz.de3959316485397785692382115,993191899613,190196822685773684
thefutoncritic.com13913019378741635233520405845825182
treccani.it33322327890234459128022292337523617868712809
trouw.nl93142869260242,7037579189933,55818,1855849113,87522,55724,77416,87027,600
tvbythenumbers.zap2it.com4522315336519119143751654912
tw.appledaily.com37,43723,16310,24553,429-58,799-1793-37,81061,70823,742-20015
universalis.fr5327335259046465812235180251275348712727101211,75418,729
variety.com1012335414137331944
vnexpress.net13,31018,184650458,2717212997239,942963919,41730,01828,70713,48612,17819857
volkskrant.nl2766918949687333451781377510,507212,1975107268716,051464415,292
web.archive.org43635212182412111718141057
who.int1113671352631322938286328610
Table A2. Position in local rankings of periodical sources in different language versions of Wikipedia in February 2020 using PR model. Source: own work based on Wikimedia dumps using complex extraction of references using complex extraction of references with semantic databases (Wikidata, DBpedia) to identify type of the source. Extended version of the table is available on the web page: http://data.lewoniewski.info/sources/a2.
Table A2. Position in local rankings of periodical sources in different language versions of Wikipedia in February 2020 using PR model. Source: own work based on Wikimedia dumps using complex extraction of references using complex extraction of references with semantic databases (Wikidata, DBpedia) to identify type of the source. Extended version of the table is available on the web page: http://data.lewoniewski.info/sources/a2.
SourcePosition in Local Rankings in Language Versions of Wikipedia
ardeenesfafritjanlplptrusvvizh
20 minutos17618618929587812656411946252333232262
Aftonbladet165715049701484546132111710139817081111127010-1369
Al-Ittihad10153029722731537--1397---1672--1290
Algemeen Dagblad387431917953731824381432184300538714202753
Aliqtisadi2-2022-669-23381138---1096---
Apple Daily562123364414068071768156173-15253081275-561
Auto, Motor und Sport11524535373-727376221275136-487585428-
China Press2227-14202356-1241-431--15672025-2548
DWDL.de16233611073471145270764362315111912963867671430
Dagens Industri133611339491682-2921572589-111462325312-1376
De Gelderlander92339710261774455628137010301045911041014-824637
De Morgen6822125085771572105934126254440830473355480
De Stentor13804181428--12232055194791333-18987956411696
De Volkskrant2931452835752212723304031669355299808373847
Die Tageszeitung37424144968013033253920611473223119766298
Donya-e-Eqtesad1272-266528052-2193----2833--1378
El Confidencial2432262193281579848525324383235321264190
El País21722440062051462634044808684260401309562
Ennahar newspaper5-2248-7271042---------
Entertainment Weekly862467713138461145
Exame7791474683610554117946679794911153191764185759
Expert1921085936117772913851501542749490205310-589926
Express Gazeta882941104513013021502119390779062117348-609824
Famitsu20041810503558101975570310-1599605693109362438
Finanztest22971404183621317049015651741006-1329909-316
Flight International3210237328445919701149871013725
Fokus5011538138012099599442054961-131676113156--
Folha de S. Paulo1191082652304621958306163481274871697-490834
Fortune293216452525543666412965103823
Gazeta do Povo14297451066385-104410111257--921156541123-
Helsingborgs Dagblad50580485796839171710495888537662793903154694
Het Laatste Nieuws214399430999836229109676231883619001162331341
Het Parool11493375505864274929333867166363017401116459538
Huffington Post France56953559940533463086012202403924515752531759
ISTOÉ8519971130668-1217919833112573289594956801106
Il Fatto Quotidiano3131262302115081474682765226353346636475663
Il Post5402075693326932183181536263299372436435440
Jeune Afrique392003422102124224463229425215364313276413
Komsomolskaya Pravda226187177418155120273131352523972350133140
la Repubblica6315455682291651104591731375962
La Tercera26948741776094991696953391311172511745379810
Le Figaro511285563321493527971784511343874704191009398
Le Monde1592443063001593246499248567325424639292351
Lenta.ru606714216292105166692242413911574398
Les Inrockuptibles2112872932331191129264222570283322547322228
NRC Next843344539111324868788410495707674837230445173
Nauka i Zhizn-25366104213711431506-104016352897-12051810
Nguoi Viet Daily News1126-1851--1064-------6858
Nihon Keizai Shimbun3221692065038142317712082791573686989014
Nikkei Business240913148982079-274712718--14102597-750306
Nishinippon Shimbun-129220921248-326617867---1576-1160115
O Estado de São Paulo89715861020590-61111441200-135251385728242829
PC Gamer515120301031555190172612531412
PC Games17858635936-10239081685563306849280441425534
Panorama565726506534885341101256-336734607-351425
People2556121381326146164193618
Pitchfork117287204015253736262028252655
Populär Historia129967118442420438251423021113-99818612187-1096
Rolling Stone762110131620152711142125283044
Rolling Stone Brasil1656231070969510638536379497561157103241018747905
Sai Gon Giai Phong71425092084-443184022171680-17978301535-3712
Sport Express3672953444903993133441794441004585626382514
Superinteressante17993132804991-7541720608-14686--446-
Svenska Dagbladet409199714951596-27897951158--8912379-414
Sydsvenskan495385566818705947004305475143055981782895
TV Guide6144175463465010359328391843854
TV Sorrisi e Canzoni1533866186971245369681805915126323357301312
TechCrunch71711914142792916122324511
Teknikens Värld-4081366-18114691357-11131607-11765-545
The Atlantic21251225724463231372442332035
The Daily Telegraph14128188161417189151617917
The Indian Express28845135992153901566187901474057
The New York Times15271416182238233438233821133
The Washington Post31331949182428252229181220
Time41191031120182210141513710
Tokyo Sports731198127040410205995272-8327691006-18119
Trouw7043345431687444279134658745267521053116310331276
USA Today69485101914151211208107
Variety111112248113422
Veja3565584421994793788168669695502619621394755
VnExpress92010189772021429745472371982127011717686711633
Vokrug sveta1906118313782121-8671055901865220-9-372687
Weekly Playboy1159-1581549-203613125-1344-1499-28931
Wired920131411189637131821291115
World Journal--714908--1307190-----10964
Wprost7416329458559081281127866598029305441004439795
Yomiuri Shimbun2731010372911563592565350113688281055-36743
¡Hola!18120418552891289122978911057207124273331
==

References

  1. Wikipedia Meta-Wiki. List of Wikipedias. Available online: https://meta.wikimedia.org/wiki/List_of_Wikipedias (accessed on 30 March 2020).
  2. English Wikipedia. Reliable Sources. Available online: https://en.wikipedia.org/wiki/Wikipedia:Reliable_sources (accessed on 30 March 2020).
  3. Internet Live Stats. Total Number of Websites. Available online: https://www.internetlivestats.com/total-number-of-websites/ (accessed on 30 March 2020).
  4. Eysenbach, G.; Powell, J.; Kuss, O.; Sa, E.R. Empirical studies assessing the quality of health information for consumers on the world wide web: A systematic review. JAMA 2002, 287, 2691–2700. [Google Scholar] [CrossRef] [PubMed]
  5. Price, R.; Shanks, G. A semiotic information quality framework: Development and comparative analysis. In Enacting Research Methods in Information Systems; Springer: Berlin, Germany, 2016; pp. 219–250. [Google Scholar]
  6. Xu, J.; Benbasat, I.; Cenfetelli, R.T. Integrating service quality with system and information quality: An empirical test in the e-service context. MIS Q. 2013, 37, 777–794. [Google Scholar] [CrossRef]
  7. Nielsen, F.Å. Scientific citations in Wikipedia. arXiv 2007, arXiv:0705.2106. [Google Scholar] [CrossRef]
  8. Lewoniewski, W.; Węcel, K.; Abramowicz, W. Analysis of references across Wikipedia languages. In Proceedings of the International Conference on Information and Software Technologies, Druskininkai, Lithuania, 12–14 October 2017; pp. 561–573. [Google Scholar]
  9. Characterizing Wikipedia Citation Usage. Analyzing Reading Sessions. Available online: https://meta.wikimedia.org/wiki/Research:Characterizing_Wikipedia_Citation_Usage/Analyzing_Reading_Sessions (accessed on 29 February 2020).
  10. Jemielniak, D.; Masukume, G.; Wilamowski, M. The most influential medical journals according to Wikipedia: Quantitative analysis. J. Med. Internet Res. 2019, 21, e11429. [Google Scholar] [CrossRef] [PubMed]
  11. Stvilia, B.; Twidale, M.B.; Smith, L.C.; Gasser, L. Assessing information quality of a community-based encyclopedia. Proc. ICIQ 2005, 5, 442–454. [Google Scholar]
  12. Blumenstock, J.E. Size matters: Word count as a measure of quality on Wikipedia. In Proceedings of the 17th International Conference on World Wide Web, Beijing, China, 21–25 April 2008; pp. 1095–1096. [Google Scholar]
  13. Lucassen, T.; Schraagen, J.M. Trust in wikipedia: How users trust information from an unknown source. In Proceedings of the 4th Workshop on Information Credibility, Raleigh, NC, USA, 27 April 2010; pp. 19–26. [Google Scholar]
  14. Yaari, E.; Baruchson-Arbib, S.; Bar-Ilan, J. Information quality assessment of community generated content: A user study of Wikipedia. J. Inf. Sci. 2011, 37, 487–498. [Google Scholar] [CrossRef]
  15. Conti, R.; Marzini, E.; Spognardi, A.; Matteucci, I.; Mori, P.; Petrocchi, M. Maturity assessment of Wikipedia medical articles. In Proceedings of the 2014 IEEE 27th International Symposium on Computer-Based Medical Systems (CBMS), New York, NY, USA, 27–29 May 2014; pp. 281–286. [Google Scholar]
  16. Piccardi, T.; Redi, M.; Colavizza, G.; West, R. Quantifying Engagement with Citations on Wikipedia. arXiv 2020, arXiv:2001.08614. [Google Scholar]
  17. Nielsen, F.Å.; Mietchen, D.; Willighagen, E. Scholia, scientometrics and wikidata. In Proceedings of the European Semantic Web Conference, Portorož, Slovenia, 28 May–1 June 2017; pp. 237–259. [Google Scholar]
  18. Teplitskiy, M.; Lu, G.; Duede, E. Amplifying the impact of open access: Wikipedia and the diffusion of science. J. Assoc. Inf. Sci. Technol. 2017, 68, 2116–2127. [Google Scholar] [CrossRef]
  19. Fetahu, B.; Markert, K.; Nejdl, W.; Anand, A. Finding news citations for wikipedia. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, Indianapolis, IN, USA, 24–28 October 2016; pp. 337–346. [Google Scholar]
  20. Ferschke, O.; Gurevych, I.; Rittberger, M. FlawFinder: A Modular System for Predicting Quality Flaws in Wikipedia. In Proceedings of the CLEF (Online Working Notes/Labs/Workshop), Rome, Italy, 17–20 September 2012; pp. 1–10. [Google Scholar]
  21. Flekova, L.; Ferschke, O.; Gurevych, I. What makes a good biography?: Multidimensional quality analysis based on wikipedia article feedback data. In Proceedings of the 23rd International Conference on World Wide Web, Seoul, Korea, 7–11 April 2014; pp. 855–866. [Google Scholar]
  22. Shen, A.; Qi, J.; Baldwin, T. A Hybrid Model for Quality Assessment of Wikipedia Articles. In Proceedings of the Australasian Language Technology Association Workshop, Brisbane, Australia, 6–8 December 2017; pp. 43–52. [Google Scholar]
  23. Di Sciascio, C.; Strohmaier, D.; Errecalde, M.; Veas, E. WikiLyzer: Interactive information quality assessment in Wikipedia. In Proceedings of the 22nd International Conference on Intelligent User Interfaces, Limassol, Cyprus, 13–16 March 2017; pp. 377–388. [Google Scholar]
  24. Dang, Q.V.; Ignat, C.L. Measuring Quality of Collaboratively Edited Documents: The Case of Wikipedia. In Proceedings of the 2016 IEEE 2nd International Conference on Collaboration and Internet Computing (CIC), Pittsburgh, PA, USA, 31 October–3 November 2016; pp. 266–275. [Google Scholar]
  25. Lewoniewski, W.; Węcel, K.; Abramowicz, W. Relative Quality and Popularity Evaluation of Multilingual Wikipedia Articles. Informatics 2017, 4, 43. [Google Scholar] [CrossRef] [Green Version]
  26. Lewoniewski, W.; Węcel, K.; Abramowicz, W. Multilingual Ranking of Wikipedia Articles with Quality and Popularity Assessment in Different Topics. Computers 2019, 8, 60. [Google Scholar] [CrossRef] [Green Version]
  27. Warncke-wang, M.; Cosley, D.; Riedl, J. Tell Me More: An Actionable Quality Model for Wikipedia. In Proceedings of the WikiSym 2013, Hong Kong, China, 5–7 August 2013; pp. 1–10. [Google Scholar]
  28. Lih, A. Wikipedia as Participatory Journalism: Reliable Sources? Metrics for evaluating collaborative media as a news resource. In Proceedings of the 5th International Symposium on Online Journalism, Austin, TX, USA, 16–17 April 2004; p. 31. [Google Scholar]
  29. Liu, J.; Ram, S. Using big data and network analysis to understand Wikipedia article quality. Data Knowl. Eng. 2018, 115, 80–93. [Google Scholar] [CrossRef]
  30. Wilkinson, D.M.; Huberman, B.A. Cooperation and quality in wikipedia. In Proceedings of the 2007 International Symposium on Wikis WikiSym 07, Montreal, QC, Canada, 21–23 October 2007; pp. 157–164. [Google Scholar] [CrossRef]
  31. Kane, G.C. A multimethod study of information quality in wiki collaboration. ACM Trans. Manag. Inf. Syst. (TMIS) 2011, 2, 4. [Google Scholar] [CrossRef]
  32. WikiTop. Wikipedians Top. Available online: http://wikitop.org/ (accessed on 30 March 2020).
  33. Lewoniewski, W. The Method of Comparing and Enriching Information in Multlingual Wikis Based on the Analysis of Their Quality. Ph.D. Thesis, Poznań University of Economics and Business, Poznań, Poland, 2018. [Google Scholar]
  34. Lerner, J.; Lomi, A. Knowledge categorization affects popularity and quality of Wikipedia articles. PLoS ONE 2018, 13, e0190674. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  35. Wikimedia Downloads. English Wikipedia Latest Database Backup Dumps. Available online: https://dumps.wikimedia.org/enwiki/latest/ (accessed on 30 March 2020).
  36. English Wikipedia. 2019–2020 Coronavirus Pandemic. Available online: https://en.wikipedia.org/wiki/2019%E2%80%9320_coronavirus_pandemic (accessed on 30 March 2020).
  37. Vrandečić, D.; Krötzsch, M. Wikidata: A free collaborative knowledgebase. Commun. ACM 2014, 57, 78–85. [Google Scholar] [CrossRef]
  38. Wikidata. Available online: https://www.wikidata.org/wiki/Wikidata:Main_Page (accessed on 23 April 2020).
  39. Auer, S.; Bizer, C.; Kobilarov, G.; Lehmann, J.; Cyganiak, R.; Ives, Z. DBpedia: A Nucleus for a Web of Open Data. In The Semantic Web; Aberer, K., Choi, K.S., Noy, N., Allemang, D., Lee, K.I., Nixon, L., Golbeck, J., Mika, P., Maynard, D., Mizoguchi, R., et al., Eds.; Springer: Berlin/Heidelberg, Germany, 2007; pp. 722–735. [Google Scholar]
  40. DBpedia. Available online: https://wiki.dbpedia.org/ (accessed on 23 April 2020).
  41. Frey, J.; Hofer, M.; Obraczka, D.; Lehmann, J.; Hellmann, S. DBpedia FlexiFusion the Best of Wikipedia> Wikidata> Your Data. In Proceedings of the International Semantic Web Conference, Auckland, New Zealand, 26–30 October 2019; pp. 96–112. [Google Scholar]
  42. GFS Data Browser. Available online: https://global.dbpedia.org (accessed on 23 April 2020).
  43. English Wikipedia. Perennial Sources. Available online: https://en.wikipedia.org/wiki/Wikipedia:Reliable_sources/Perennial_sources (accessed on 30 March 2020).
  44. BestRef. Popular and Reliable Sources of Wikipedia. Available online: https://bestref.net (accessed on 30 March 2020).
Figure 1. The most popular domains in over 280 million external links from 55 language versions of Wikipedia. Source: own calculations based on Wikimedia Dumps as of March 2020 using basic extraction method. The most popular domains in external links in other language versions are available on the web page: http://data.lewoniewski.info/sources/basic.
Figure 1. The most popular domains in over 280 million external links from 55 language versions of Wikipedia. Source: own calculations based on Wikimedia Dumps as of March 2020 using basic extraction method. The most popular domains in external links in other language versions are available on the web page: http://data.lewoniewski.info/sources/basic.
Information 11 00263 g001
Figure 2. The most popular templates used in references in English Wikipedia. Source: own calculations based on Wikimedia Dumps as of March 2020. The most popular templates in other language versions are presented on the web page: http://data.lewoniewski.info/sources/templates.
Figure 2. The most popular templates used in references in English Wikipedia. Source: own calculations based on Wikimedia Dumps as of March 2020. The most popular templates in other language versions are presented on the web page: http://data.lewoniewski.info/sources/templates.
Information 11 00263 g002
Figure 3. Table with references in the Wikipedia article 2019–2020 coronavirus pandemic that was added using template 2019–2020 coronavirus pandemic data. Source [36].
Figure 3. Table with references in the Wikipedia article 2019–2020 coronavirus pandemic that was added using template 2019–2020 coronavirus pandemic data. Source [36].
Information 11 00263 g003
Figure 4. The most popular domains in URL of references of Wikipedia articles in 55 language versions. Source: own calculations based on Wikimedia Dumps as of March 2020 using complex extraction method.
Figure 4. The most popular domains in URL of references of Wikipedia articles in 55 language versions. Source: own calculations based on Wikimedia Dumps as of March 2020 using complex extraction method.
Information 11 00263 g004
Figure 5. Spearman’s correlation coefficients between rankings of the top 100 most reliable sources in each language version of Wikipedia in February 2020 obtained by F-model in comparison with other models. Interactive version of the heatmap is available on the web page: http://data.lewoniewski.info/sources/heatmap/.
Figure 5. Spearman’s correlation coefficients between rankings of the top 100 most reliable sources in each language version of Wikipedia in February 2020 obtained by F-model in comparison with other models. Interactive version of the heatmap is available on the web page: http://data.lewoniewski.info/sources/heatmap/.
Information 11 00263 g005
Figure 6. The most commonly used titles in publisher parameter of citations templates in the references of articles in English Wikipedia in March 2020. Source: own calculations based on Wikimedia dumps using complex extraction method.
Figure 6. The most commonly used titles in publisher parameter of citations templates in the references of articles in English Wikipedia in March 2020. Source: own calculations based on Wikimedia dumps using complex extraction method.
Information 11 00263 g006
Figure 7. The most commonly used titles in the journal parameter of citations templates in the references of articles in English Wikipedia in March 2020. Source: own calculations based on Wikimedia dumps using complex extraction method.
Figure 7. The most commonly used titles in the journal parameter of citations templates in the references of articles in English Wikipedia in March 2020. Source: own calculations based on Wikimedia dumps using complex extraction method.
Information 11 00263 g007
Table 1. Total number of articles, number of articles with a certain number of external links (URLs), total and unique number of external links in different language versions of Wikipedia. Source: own calculations based on Wikimedia dumps in March 2020 using complex extraction of references.
Table 1. Total number of articles, number of articles with a certain number of external links (URLs), total and unique number of external links in different language versions of Wikipedia. Source: own calculations based on Wikimedia dumps in March 2020 using complex extraction of references.
LanguageNumber of ArticlesNumber of URLs
Allwith >=1 URL>= 10 URLs>=100 URLsAllUnique
ar (Arabic)1,031,740917,809305,11843699,443,7887,599,390
az (Azerbaijani)156,442109,74320,299237674,212512,465
be (Belarusian)185,753150,11621,0672991,142,005958,165
bg (Bulgarian)260,081211,03127,8061851,174,3241,030,715
ca (Catalan)638,664600,711336,30217708,111,1047,124,746
cs (Czech)447,120377,64769,82112202,769,4152,438,870
da (Danish)257,321211,41551,6894881,711,6771,605,379
de (German)2,403,6831,990,310528,524784917,646,88215,632,584
el (Greek)174,589151,00843,6648911,479,9331,254,224
en (English)6,029,2015,500,5271,963,70360,38469,554,57556,030,670
eo (Esperanto)275,674223,65221,028851,016,902928,935
es (Spanish)1,528,8111,395,107484,650552113,935,33211,872,312
et (Estonian)206,430136,6518,344146526,292466,916
eu (Basque)349,176331,83697,4691042,692,6392,177,612
fa (Persian)712,216656,16152,77910302,779,2932,232,907
fi (Finnish)479,830405,37261,3875452,446,5381,889,702
fr (French)2,185,8851,830,876593,874732717,918,67315,313,234
gl (Galician)161,860127,39552,1595951,483,5411,315,467
he (Hebrew)261,209213,98976,2743472,152,9421,987,360
hi (Hindi)140,32797,70610,102370563,963379,306
hr (Croatian)198,670137,94910,796155587,017449,783
hu (Hungarian)465,509411,07297,28911793,231,8802,796,234
hy (Armenian)264,676219,04550,68112182,073,9401,534,220
id (Indonesian)524,100409,93753,08512672,496,1582,158,397
it (Italian)1,586,8551,374,018403,171319411,889,37710,141,992
ja (Japanese)1,192,596890,138205,26442107,449,6426,309,830
ka (Georgian)135,333102,91010,508239533,019420,322
kk (Kazakh)230,376137,3336,53654736,786591,481
ko (Korean)486,067318,19063,42511102,197,7771,990,960
la (Latin)132,258106,8873,59222347,131287,532
lt (Lithuanian)196,606136,9824,23827390,006331,424
ms (Malay)335,222191,20618,288431868,166716,712
nl (Dutch)1,999,0921,626,60231,70014604,303,8133,295,204
nn (Norwegian (Nynorsk))151,857126,22916,64273624,568561,283
no (Norwegian)529,426466,557132,8176723,812,7913,410,905
pl (Polish)1,387,1641,177,588159,95623346,962,4075,673,526
pt (Portuguese)1,022,524925,771186,88944547,836,4166,583,420
ro (Romanian)404,748352,33880,1119702,742,3212,375,095
ru (Russian)1,602,7611,333,264527,323818416,116,79512,370,583
sh (Serbo-Croatian)451,298383,945223,6522924,464,5691,118,996
simple (Simple English)155,887103,88610,990264548,488480,654
sk (Slovak)232,551176,18810,893268823,474681,781
sl (Slovenian)167,119135,61421,910219786,235710,113
sr (Serbian)630,870552,58453,1857613,502,2131,959,054
sv (Swedish)3,740,4113,590,906798,561235621,372,06811,686,205
ta (Tamil)132,424105,18610,658228569,482401,066
th (Thai)135,62793,94516,965726758,451667,308
tr (Turkish)343,216257,97640,30513061,762,8051,495,178
uk (Ukrainian)994,030859,711185,47024766,973,4555,195,088
ur (Urdu)154,282120,1895229191403,727354,010
uz (Uzbek)133,77492,36996427299,080265,877
vi (Vietnamese)1,241,4871,178,17746,83515803,604,0332,846,271
vo (Volapük)124,18993,9249-104,201103,660
zh (Chinese)1,099,744862,260175,49648736,757,6465,779,801
zh-min-nan (Min Nan)267,615192,9335191353,098274,056
Table 2. Total and unique number of references with special identifiers: DOI, ISBN, ISSN, PMID, PMC. Source: own calculations based on Wikimedia dumps as of March 2020 using complex extraction of references.
Table 2. Total and unique number of references with special identifiers: DOI, ISBN, ISSN, PMID, PMC. Source: own calculations based on Wikimedia dumps as of March 2020 using complex extraction of references.
LanguageDOIISBNISSNPMIDPMC
AllUniqueAllUniqueAllUniqueAllUniqueAllUnique
ar (Arabic)130,24687,431169,58378,65524,038771883,22858,09718,37412,793
az (Azerbaijani)2290132023,3038823903260540383128107
be (Belarusian)2494156848,314642610492881120735165111
bg (Bulgarian)8431582353,73814,536134550360243745989700
ca (Catalan)49,81934,451226,67776,50825,939687827,67821,79678296343
cs (Czech)26,41315,891177,25233,40228,785465912,27177951318925
da (Danish)7440461932,22313,041152254048792859892556
de (German)158,39982,168890,727199,94977,06513,25018,89312,82114,6609284
el (Greek)22,80314,41666,98227,2924751154112,325775825091647
en (English)2,130,154919,4804,374,241848,284550,83439,487993,092477,883346,934156,941
eo (Esperanto)4806317718,128946468733219221249565340
es (Spanish)136,76177,866653,902168,30697,68814,20165,49939,32814,8678457
et (Estonian)7481326916,6505148534171480920631134509
eu (Basque)7136511517,1599413681122021511119059383633
fa (Persian)27,02518,63145,84920,9085212214617,18011,37144993002
fi (Finnish)10,1515394177,95225,0857991195451822936370276
fr (French)164,24473,634954,903201,227125,84215,59240,17726,03542942307
gl (Galician)45,23130,38568,55821,6786351193634,90724,79610,2757092
he (Hebrew)8611775112,9539661159941838833632590546
hi (Hindi)10,907698827,21212,30010114399168621314421019
hr (Croatian)5570331820,663805390329248722726726471
hu (Hungarian)21,99814,47365,54820,5964245152812,139815420901466
hy (Armenian)36,91822,19259,67925,8576585234832,28219,08476284568
id (Indonesian)36,81921,546139,96946,1219416245516,64510,74339572709
it (Italian)90,20751,150423,66895,21413,880342245,39432,05270304642
ja (Japanese)119,91458,187573,91892,36229,754556742,79327,13011,2256598
ka (Georgian)3592254115,4256310122833913941084297238
kk (Kazakh)37531255,956134667392101787863
ko (Korean)40,52920,76162,38423,8046355164014,499937438882453
la (Latin)7005212870186049362942457455
lt (Lithuanian)1456108310,8513597316138940655187144
ms (Malay)11,423768130,83814,5831931694567839311416945
nl (Dutch)12,669853845,58816,29617828217496533914701036
nn (Nynorsk)3412178919,9036649611180770479165108
no (Norwegian)11,868652169,35025,9327054139149243169734510
pl (Polish)131,70442,238519,93462,97474,490911149,27428,00269443716
pt (Portuguese)84,66445,575263,77481,58334,965702933,82620,82571234466
ro (Romanian)18,71511,59262,05722,4483114104810,780648822601507
ru (Russian)133,38863,725639,602131,76968,76511,25937,71623,11973284422
sh (Serbo-Croatian)53,96512,92244,87511,669337465729,01221,92732272225
simple (Simple Engl.)7730533726,29913,472210361242652953908668
sk (Slovak)3166223840,302891470471131735569127106
sl (Slovenian)12,819794344,21312,27315796749541599416891104
sr (Serbian)67,07922,11594,35229,5766807200735,54126,30850113336
sv (Swedish)863,3378954145,17727,16911,5012399660536941336817
ta (Tamil)19,67914,00628,47015,720171476211,131816420881386
th (Thai)26,44916,16232,28814,812287993718,95911,73042512681
tr (Turkish)18,68111,11854,77522,021349110279360602117391202
uk (Ukrainian)255,14424,659122,99937,64453,699334955,24910,14332242216
ur (Urdu)148189783894735362138546379157106
uz (Uzbek)144126870542251926241110
vi (Vietnamese)71,16239,745139,02744,89610,740277332,72921,70186745665
vo (Volapük)--8777------
zh (Chinese)109,03459,455362,31092,42225,637634948,21529,79111,0776883
zh-min-nan (Min Nan)290163618262201162512015
Table 3. Total and unique number of references with special identifiers: arXiv, Bibcode, JSTOR, LCCN, OCLC. Source: own calculations based on Wikimedia dumps as of March 2020 using complex extraction of references.
Table 3. Total and unique number of references with special identifiers: arXiv, Bibcode, JSTOR, LCCN, OCLC. Source: own calculations based on Wikimedia dumps as of March 2020 using complex extraction of references.
LanguagearXivBibcodeJSTORLCCNOCLC
AllUniqueAllUniqueAllUniqueAllUniqueAllUnique
ar (Arabic)8604301621,12999435123369342528783705091
az (Azerbaijani)1447279739247414118375504174
be (Belarusian)2531295473185238438041
bg (Bulgarian)4043091395108929822710447799344
ca (Catalan)1735911556233521641102519010140181543
cs (Czech)143658038171713207139241774252516
da (Danish)1618575554124617097401254399
de (German)64303318758635913789206026611646332516
el (Greek)182987152622963970523140571477631
en (English)154,57928,727396,409117,983169,41971,44722,3744921312,75579,862
eo (Esperanto)39202411793562532221199155
es (Spanish)2914165312,18872526713385867126760,59714,105
et (Estonian)3201321355597134689514876
eu (Basque)18551387165534643173118
fa (Persian)8985333200213367952995422005875
fi (Finnish)110894603451641043828133100
fr (French)11,448304323,5137147534527557653261677,03723,267
gl (Galician)83134038942323152484167820758101411
he (Hebrew)70683443152452269915991219
hi (Hindi)106327321897752221575533619349
hr (Croatian)16612468852111780144396162
hu (Hungarian)3572431602116462046159421282481
hy (Armenian)4482553666168176255086381905849
id (Indonesian)2314819778436382405119848916010,0392717
it (Italian)284612915860361019161138241968320,1147209
ja (Japanese)11,253307533,24594532755154381123412,1693876
ka (Georgian)42526911438021571158252465250
kk (Kazakh)362055501714--2016
ko (Korean)7621256515,160551712818373741141529623
la (Latin)4452454734664430
lt (Lithuanian)1227919614747381110549
ms (Malay)6573742386157052836057432337646
nl (Dutch)3528317261163123225336271
nn (Norwegian (Nynorsk))749169164957019598195223123
no (Norwegian)9752523027125839224943261547611
pl (Polish)24938635414221512616352237928,1957579
pt (Portuguese)4260166619,60260132844180632117011,5144891
ro (Romanian)118149540042270681480175852008800
ru (Russian)12,622330125,75477562358128836813149881772
sh (Serbo-Croatian)17191110172040129537173435787
simple (Simple English)544265122282522717746261246404
sk (Slovak)1981313982912417103334160
sl (Slovenian)66418714736762982614016504330
sr (Serbian)637415298220729757181298742211914
sv (Swedish)1042391309712573112231982251051629
ta (Tamil)699306262516635473728443895475
th (Thai)1053340285915064923263726997507
tr (Turkish)215076952822395701380107591550588
uk (Ukrainian)4327175414,62852439436602148930111450
ur (Urdu)933320812714110497385158
uz (Uzbek)24209379651641411
vi (Vietnamese)7798264418,88178952568146234220867791895
vo (Volapük)----------
zh (Chinese)11,497349627,19910,4592819162358925512,3603482
zh-min-nan (Min Nan)11824397--99
Table 4. Identifiers that were used for URL unification of references.
Table 4. Identifiers that were used for URL unification of references.
IdentifierDescriptionURL
arXivarXiv repository identifierhttps://arxiv.org/abs/
BibcodeCompact identifier used by several astronomical data systemshttps://adsabs.harvard.edu/abs/
DOIDigital object identifierhttps://doi.org/...
ISBNInternational Standard Book Numberhttps://books.google.com/books?vid=ISBN
ISSNInternational Standard Serial Numberhttps://worldcat.org/ISSN/
JSTORJournal Storage numberhttps://jstor.org/stable/
LCCNLibrary of Congress Control Numberhttps://lccn.loc.gov/
PMCPubMed Centralhttps://ncbi.nlm.nih.gov/pmc/articles/PMC
PMIDPubMedhttps://ncbi.nlm.nih.gov/pubmed/
OCLCWorldCat’s Online Computer Library Centerhttps://worldcat.org/oclc/
Table 5. Total number of articles, number of articles with at least 1 reference, at least 10 references, at least 100 references and number of total and unique number of references in each considered language version of Wikipedia. Source: own calculation based on Wikimedia dumps as of March 2020 using complex extraction of references.
Table 5. Total number of articles, number of articles with at least 1 reference, at least 10 references, at least 100 references and number of total and unique number of references in each considered language version of Wikipedia. Source: own calculation based on Wikimedia dumps as of March 2020 using complex extraction of references.
LanguageNumber of ArticlesNumber of References
Allwith >= 1 ref.with >= 10 refs.with >= 100 refs.AllUnique
ar (Arabic)1,031,740817,48558,30325883,598,6912,138,127
az (Azerbaijani)156,44277,2136476440430,655210,186
be (Belarusian)185,75390,4275897269352,275163,649
bg (Bulgarian)260,081152,63213,099330702,747397,568
ca (Catalan)638,664421,09655,87014432,676,8701,334,484
cs (Czech)447,120229,70045,79312671,762,136911,167
da (Danish)257,32199,18813,157615614,575395,741
de (German)2,403,6831,350,469276,204621410,343,1006,150,128
el (Greek)174,589100,64524,0801000971,438589,234
en (English)6,029,2014,738,5261,363,47567,17958,914,06228,973,680
eo (Esperanto)275,67454,8394091149230,042152,878
es (Spanish)1,528,8111,078,622233,77454,96314,428,5144,495,443
et (Estonian)206,43090,62811,709392568,263258,665
eu (Basque)349,176157,6794045116563,102172,629
fa (Persian)712,216383,13123,18311111,393,976840,009
fi (Finnish)479,830340,42565,71414642,514,6371,198,430
fr (French)2,185,8851,290,227314,89312,30312,407,7096,477,543
gl (Galician)161,86073,04012,476560560,381297,875
he (Hebrew)261,209126,06324,712360895,644777,279
hi (Hindi)140,32755,1736354403331,919203,538
hr (Croatian)198,670100,7498457313463,336246,391
hu (Hungarian)465,509174,54741,58512921,433,477817,002
hy (Armenian)264,676182,06519,299937984,768528,465
id (Indonesian)524,100226,67333,69115421,525,411845,109
it (Italian)1,586,855698,996143,03454065,895,5163,273,847
ja (Japanese)1,192,596694,366206,82210,2298,701,3854,485,637
ka (Georgian)135,33346,3084945290265,153160,032
kk (Kazakh)230,376144,401101148274,52952,503
ko (Korean)486,067170,64624,46710061,136,561725,725
la (Latin)132,25845,476156327128,99266,105
lt (Lithuanian)196,60668,043322148212,662143,521
ms (Malay)335,22276,84510,534469487,718311,772
nl (Dutch)1,999,092956,91827,7686192,082,3681,198,126
nn (Norwegian (Nynorsk))151,85744,1914588126220,340125,740
no (Norwegian)529,426253,18323,9329531,243,303691,525
pl (Polish)1,387,164802,519160,59941686,035,3452,467,049
pt (Portuguese)1,022,524728,219103,34454804,944,3212,710,271
ro (Romanian)404,748232,24832,52712951,481,560625,841
ru (Russian)1,602,761978,601219,13580838,857,3264,610,614
sh (Serbo-Croatian)451,298338,34015,4513931,322,980214,925
simple (Simple English)155,88781,6918799311431,401274,407
sk (Slovak)232,55189,3458027228411,430224,896
sl (Slovenian)167,11964,2107717343367,513197,430
sr (Serbian)630,870494,94618,8708162,789,499487,172
sv (Swedish)3,740,4113,123,685135,22833,49720,053,4934,207,630
ta (Tamil)132,42491,0238981280490,602255,568
th (Thai)135,62769,95412,634642581,563362,812
tr (Turkish)343,216163,28722,47210911,121,121690,471
uk (Ukrainian)994,030579,40781,90816813,894,4371,417,597
ur (Urdu)154,282114,6663214185259,328194,444
uz (Uzbek)133,77425,0825853155,67323,288
vi (Vietnamese)1,241,4871,053,26641,64018792,747,7811,602,977
vo (Volapük)124,1896559-15251374
zh (Chinese)1,099,744630,774112,95352875,009,9842,740,728
zh-min-nan (Min Nan)267,61540,194161261,8964898
Table 6. Spearman’s correlation coefficients between rankings of the top 100 most popular and reliable sources in multilingual Wikipedia in February 2020 using different models.
Table 6. Spearman’s correlation coefficients between rankings of the top 100 most popular and reliable sources in multilingual Wikipedia in February 2020 using different models.
ModelsFPPRPLPmPmRPmLAARAL
F1.000.370.500.470.380.490.440.620.780.80
P0.371.000.870.910.990.870.890.810.410.53
PR0.500.871.000.980.871.000.940.820.610.68
PL0.470.910.981.000.920.970.970.830.530.66
Pm0.380.990.870.921.000.880.900.820.420.55
PmR0.490.871.000.970.881.000.950.830.590.67
PmL0.440.890.940.970.900.951.000.810.490.65
A0.620.810.820.830.820.830.811.000.680.79
AR0.780.410.610.530.420.590.490.681.000.92
AL0.800.530.680.660.550.670.650.790.921.00
Table 7. Spearman’s correlation coefficients between rankings of the top 10,000 most popular and reliable sources in multilingual Wikipedia in February 2020 using different models.
Table 7. Spearman’s correlation coefficients between rankings of the top 10,000 most popular and reliable sources in multilingual Wikipedia in February 2020 using different models.
ModelsFPPRPLPmPmRPmLAARAL
F1.000.380.510.500.380.490.460.680.800.84
P0.381.000.670.790.990.670.780.720.370.47
PR0.510.671.000.890.660.980.850.560.650.62
PL0.500.790.891.000.780.880.960.590.510.62
Pm0.380.990.660.781.000.690.790.730.370.48
PmR0.490.670.980.880.691.000.880.570.640.62
PmL0.460.780.850.960.790.881.000.590.490.62
A0.680.720.560.590.730.570.591.000.720.81
AR0.800.370.650.510.370.640.490.721.000.91
AL0.840.470.620.620.480.620.620.810.911.00
Table 8. Position in rankings of publishers in English Wikipedia depending on popularity and reliability model in February 2020. Source: own calculation based on Wikimedia dumps using complex extraction and using only values from publisher parameter of citation templates in references. Extended version of the table is available on the web page: http://data.lewoniewski.info/sources/table8.
Table 8. Position in rankings of publishers in English Wikipedia depending on popularity and reliability model in February 2020. Source: own calculation based on Wikimedia dumps using complex extraction and using only values from publisher parameter of citation templates in references. Extended version of the table is available on the web page: http://data.lewoniewski.info/sources/table8.
SourcePosition in the Ranking Depending on Model
FPPRPLPmPmRPmLAARAL
ABC News71185439205743348361
ABC-CLIO36202523182422252324
AllMusic8288826891467
Anime News Network27341215321415402030
BBC3433543233
BBC News10575675788
BBC Sport4111512161713575
Cambridge University Press5322322344
Canadian Online Explorer1198519912784198124136629
CBS Interactive2091079108121510
CNN2229629661612
CRC Press53632027582025634045
Cricketarchive148234604459807366552107479
ESPN138171481916101314
Harpercollins121175236144832286555
Hung Medien18383421333221192620
IGN32372924342923221715
IMDB65551617651619893447
International Skating Union1534025620938232225214110869
John Wiley & Sons41261316191214302525
Macmillan63224338174135295152
Metacritic52492120512320573731
Microsoft1137141371310374646
MTV78194631215031173626
National Center For Education Statistics66232634945224313964952261117
National Park Service7943848894758601211
Official Charts Company16302418312618182118
Oxford University Press2111111121
Prometheus Global Media11362722373027201816
Routledge6644434456
Simon & Schuster75142625112726234541
Springer1912691057161013
The Hindu72183182616315241855665
United States Census Bureau12751124612912
United States Geological Survey1217078951598098741421
University of California Press24161919121817151922
Wikimedia Foundation5541101459247379251113830082520
WWE9221624022644282419
Yale University Press9132828132829213328
YouTube171511101511111199
Table 9. Position in rankings of journals in English Wikipedia depending on popularity and reliability model in February 2020. Source: own calculation based on Wikimedia dumps using complex extraction and using only values from journal parameter in citation templates in references. Extended version of the table is available on the web page: http://data.lewoniewski.info/sources/table9.
Table 9. Position in rankings of journals in English Wikipedia depending on popularity and reliability model in February 2020. Source: own calculation based on Wikimedia dumps using complex extraction and using only values from journal parameter in citation templates in references. Extended version of the table is available on the web page: http://data.lewoniewski.info/sources/table9.
SourcePosition in Ranking Depending on Model Models
FPPRPLPmPmRPmLAARAL
American Family Physician84364220253817204719
Astronomy & Astrophysics12565639455738383416
Astronomy and Astrophysics23125112225121274
Astronomy Letters19208518225282311228172247320743
Billboard916891278657
BMJ36141212101111101817
Cell16321415201213281923
Communications of the ACM18829343817361195499
Emory Law Journal8049111147737480302237885736978
Icarus14213827163625112014
JAMA54251917182016153326
Journal of The American Chemical Society30792129521827612838
Journal of Virology120331824241923233203199
Lancet233753544119
Lloyd’s List5127856473196299211528828159847356
LPSN1747576091375978118725918209436
Mammalian Species56776742586639313620
MIT Technology Review55655574119209132120939003338
Molecular Phylogenetics and Evolution341014148944346471721
Monthly Notices of The Royal Astronomical Society730261921262118138
Myconet63215066401106341912978413434071537
Nature1111111111
Nature News885201108548228200406410522
New England Journal of Medicine60192216132115344645
Pediatrics62384335284032163928
Physical Review Letters26351125231020251424
PLOS ONE6443433346
Proceedings of the National Academy of Sciences181591011899915
Proceedings of the National Academy of Sciences of the United States of America138587478812
Rolling Stone55181314151314131618
Science3222222222
The Astronomical Journal8425833356334242111
The Astrophysical Journal4766665765
The Cochrane Database of Systematic Reviews27610759651210
The Guardian184176851311028697127125
The IUCN Red List of Threatened Species102613438275586211533
The Journal of American History8059866426188158282599698
The Journal of Biological Chemistry15571723411419542329
The Lancet3812231892318193833
The New England Journal of Medicine4810161381510143722
Time64132026142426172527
Variety86341522271624372635
Wired141223028173028265251
Zookeys206491931727342952893624242
Zootaxa1115359561537870411013
Table 10. Position in the global ranking of the most popular and reliable sources with identified title in 55 considered language versions of Wikipedia depending on the model in February 2020. Source: own calculations based on Wikimedia dumps using complex extraction of references. Extended version of the table is available on the web page: http://data.lewoniewski.info/sources/table10.
Table 10. Position in the global ranking of the most popular and reliable sources with identified title in 55 considered language versions of Wikipedia depending on the model in February 2020. Source: own calculations based on Wikimedia dumps using complex extraction of references. Extended version of the table is available on the web page: http://data.lewoniewski.info/sources/table10.
SourceModel
FPPRPLPmPmRPmLAARAL.
American Museum of Natural History19604868594669415880698445979
CBS News42133333133635234944
CNN14717157161741415
Collider55162725152722396557
Deadline Hollywood1111111133
Entertainment Weekly1255655651314
Forbes208108897152017
GameSpot11192414142414111512
IndieWire8115161719192182109102
Internet Movie Database421352134611
MTV2118292917292971618
Newspapers.com530152033202317117
Official Charts10312022281718131210
Oricon9117411751285
People53171211201112222323
Pitchfork15292318272116211816
Rotten Tomatoes171067106818913
Soccerway710040521165060321011
TV by the Numbers2343343368
TVLine43261827241526454752
TechCrunch34202613162211385232
The Atlantic48123534183737335346
The Daily Telegraph28142121122520203125
The Futon Critic18361930351830272736
The Indian Express31371416361315252622
The Washington Post13499410981719
Time299221992319193327
USA Today166111061210101920
Variety3222222222
Wayback Machine83813243714251656
WordPress.com63381231813944
Table 11. The most popular and reliable types of the sources in selected language versions of Wikipedia in February 2020 based on PR model. Source: own calculations based on Wikimedia dumps using complex extraction of references with semantic databases (Wikidata, DBpedia) to identify type of the source. Extended version of the table is available on the web page: http://data.lewoniewski.info/sources/table11.
Table 11. The most popular and reliable types of the sources in selected language versions of Wikipedia in February 2020 based on PR model. Source: own calculations based on Wikimedia dumps using complex extraction of references with semantic databases (Wikidata, DBpedia) to identify type of the source. Extended version of the table is available on the web page: http://data.lewoniewski.info/sources/table11.
Source TypeLanguage Version of Wikipedia
ardeenesfafritjanlplptrusvvizh
archive1256391227302421363138362158
business735526937325533
daily newspaper94461082411654699
enterprise14678610869786754
film database210109735513248181710
government agency2575516045259457124626046256
holding company1352526213319411515224719998141391357
magazine822257474532846
morning paper1642452215444453875016444175054825402381504
natural history museum5615833915798004054427921947841451055610523
news agency401134965361567211410499661245453
news website21126413971517206742155
newspaper3837923951379972
online database4131214115134112101512171626
online newspaper1826131024202323233325337212
open-access publisher17182620181922302625193226817
organization11991184111010991110613
periodical37515322111236642213121234
public broadcasting668036775278821887787931124580
review aggregator6151617151416251582119202427
social cataloging application51414151413144814122018192330
social networking service33302229262729162814353159308
specialty channel102311221718183421151717281818
television network537353345387712582871484213779
television station20161727202219816212914541920
website111111112111111
weekly magazine26118161612102418231010411311
written work12325616710443064652914115578164418263519
Table 12. The most popular periodical sources in Wikipedia articles from 55 language versions using different popularity and reliability models in February 2020. Source: own calculations based on Wikimedia dumps using complex extraction of references with semantic databases (Wikidata, DBpedia) to identify type of the source. Extended version of the table is available on the web page: http://data.lewoniewski.info/sources/table12.
Table 12. The most popular periodical sources in Wikipedia articles from 55 language versions using different popularity and reliability models in February 2020. Source: own calculations based on Wikimedia dumps using complex extraction of references with semantic databases (Wikidata, DBpedia) to identify type of the source. Extended version of the table is available on the web page: http://data.lewoniewski.info/sources/table12.
SourceModels
FPPRPLPmPmRPmLAARAL
Entertainment Weekly2322322222
Flight International20192522172220172620
Fortune36151717151716253628
Komsomolskaya Pravda21362428372329312026
Lenta.ru621131619131714911
New York Post27182118162021192421
Nihon Keizai Shimbun14271613261613241615
PC Gamer28252220242118263530
People18855946867
Pitchfork413981278743
Rolling Stone16111011101111101516
Spin26293030303031202922
TV Guide33281821271922291923
TechCrunch119116785161713
Technology Review10716484128615295118116
The Atlantic176141481515131818
The Daily Telegraph77710610106108
The Express Tribune24422826402625272319
The Globe and Mail10221919201819151214
The Indian Express914671467976
The Japan Times42233232183235454343
The New York Times12121515131414121312
The Wall Street Journal29202725222827233127
The Washington Post3233233334
Time85895995119
USA Today5444454455
Ukrayinska Pravda19617668617672354942
Variety1111111111
Wired13101212111212111410
la Repubblica1517202421243018817
Table 13. Position in rankings of popular and reliable sources depending on period in all considered language versions of Wikipedia using PR model. Source: own work based on Wikimedia dumps using complex extraction of references with semantic databases (Wikidata, DBpedia) to identify title of the sources. Extended version of the table is available on the web page: http://data.lewoniewski.info/sources/table13.
Table 13. Position in rankings of popular and reliable sources depending on period in all considered language versions of Wikipedia using PR model. Source: own work based on Wikimedia dumps using complex extraction of references with semantic databases (Wikidata, DBpedia) to identify title of the sources. Extended version of the table is available on the web page: http://data.lewoniewski.info/sources/table13.
SourcesMonths
Decemberc 2019January 2020February 2020March 2020
CNN18201617
Deadline Hollywood1111
Entertainment Weekly5555
Forbes99910
GameSpot17162224
IndieWire24172016
Internet Movie Database4343
Newspapers.com19181815
Official Charts15192120
Oricon7777
People12101112
Rotten Tomatoes6666
TV by the Numbers3434
TVLine14151418
The Daily Telegraph20211721
The Futon Critic21231919
The Indian Express16121514
The Washington Post1114129
USA Today13111011
Variety2222
Wayback Machine10131313
WordPress.com8888
Table 14. Position in rankings of popular and reliable sources depending on period in all considered language versions using PR model. Source: own work based on Wikimedia dumps using complex extraction of references with semantic databases (Wikidata, DBpedia) to identify type of the source. Extended version of the table is available on the web page: http://data.lewoniewski.info/sources/table14.
Table 14. Position in rankings of popular and reliable sources depending on period in all considered language versions using PR model. Source: own work based on Wikimedia dumps using complex extraction of references with semantic databases (Wikidata, DBpedia) to identify type of the source. Extended version of the table is available on the web page: http://data.lewoniewski.info/sources/table14.
SourcesMonths
December 2019January 2020February 2020March 2020
Apple Daily29313035
Empire32293333
Entertainment Weekly2222
Flight International23242025
Fortune17191917
GamesMaster28282929
Komsomolskaya Pravda21222324
la Repubblica25252520
Lenta.ru11121213
Metro24232623
New York Post20212121
Nihon Keizai Shimbun15151616
PC Gamer22202422
People4345
Pitchfork8999
Radio Times26262226
Rolling Stone12111010
Spin30323230
TV Guide19171818
TechCrunch9101111
The Atlantic16161514
The Daily Telegraph7777
The Express Tribune37302728
The Globe and Mail18181719
The Indian Express6566
The New York Times14141415
The Wall Street Journal27272827
The Washington Post3653
Time10888
USA Today5434
Variety1111
Wired13131312
Table 15. Position of the periodical sources in growth ranking in selected language versions of Wikipedia and period of time using F model. Source: own work based on Wikimedia dumps using complex extraction of references with semantic databases (Wikidata, DBpedia). Extended version of the table is available on the web page: http://data.lewoniewski.info/sources/table15.
Table 15. Position of the periodical sources in growth ranking in selected language versions of Wikipedia and period of time using F model. Source: own work based on Wikimedia dumps using complex extraction of references with semantic databases (Wikidata, DBpedia). Extended version of the table is available on the web page: http://data.lewoniewski.info/sources/table15.
SourceGerman Wikipedia (de)French WikipediaRussian Wikipedia
December 2019January 2020February 2020December 2019January 2020February 2020December 2019January 2020February 2020
Auto, Motor und Sport141842326234123731007103382
Daily Herald50531035623673691659686698
Die Tageszeitung12110897671102715185
El Observador36333280836882901583621625
Entertainment Weekly10493410391117311
GamesMaster768666101110510822
Handelsblatt2133269174317641799251725352571
Jeune Afrique5927040131163202124
Jüdische Allgemeine42033721120114599810241051
Komsomolskaya Pravda1063392612125135140123
La Montagne19197491289422---
Lenta.ru31773315925217778255
Les Inrockuptibles183153261925312479398
Metal.de2753278164254480327396127
News.de354327914061433165938964989
Objectif Gard12921503102583413---
Pitchfork424250133821564
Sport Express1791872946449479741
Süddeutsche Zeitung307613281285573588383445422
TVyNovelas2765280623418863749124399369
The Washington Post132965119141613
Time53035202823152021
Variety362314312
Table 16. Position of the sources in growth ranking in selected language versions of Wikipedia and period of time using PR model. Source: own work based on Wikimedia dumps using complex extraction of references with semantic databases (Wikidata, DBpedia). Extended version of the table is available on the web page: http://data.lewoniewski.info/sources/table16.
Table 16. Position of the sources in growth ranking in selected language versions of Wikipedia and period of time using PR model. Source: own work based on Wikimedia dumps using complex extraction of references with semantic databases (Wikidata, DBpedia). Extended version of the table is available on the web page: http://data.lewoniewski.info/sources/table16.
SourceGerman Wikipedia (de)French WikipediaRussian Wikipedia
December 2019January 2020February 2020December 2019January 2020February 2020December 2019January 2020February 2020
Algemeen Dagblad53294123039300016222362518157
Atlanta (magazine)302030903214193152124426572274
Auto, Motor und Sport307553268844625532330350150
Deutsche Jagd-Zeitung231033130------
Die Tageszeitung3076132763044102301125091162694
DWDL.de307323280220027912212823171482
Entertainment Weekly3310232752315131802132777
Izvestia19528942630953631230619215
Jeune Afrique274121630243931771666702161
Komsomolskaya Pravda168261830883016302114732779
la Repubblica995613113123162261592767
Le Figaro étudiant77818891516308026269123111861
Le Monde171433306731212317825644492473
Lenta.ru1930713224397301616122780
Les Inrockuptibles2739264291831221317239024942316
New York Post298481322142223158103251
Novosti Kosmonavtiki563162445722768072468265726892
PC Gamer30431733213305316431455102774
People305183266310153171273342772
Politico1002579471512673216239231
Polka Magazine---12558765---
Radio Times602032594623161272382766
Russkij medicinskij zhurnal177329582709---14352768
Sankt-Peterburgskie Vedomosti6981838914899231512552728403
Sport Express28898502647297830182345327482776
Süddeutsche Zeitung3064432813012283296725901252673
The Daily Gazette13516982629543117421326801986
The Daily Telegraph4307632671325316335224741365
The Tennessean27341645153973124742232614
Time53099327351931764942773
USA Today18103268163316616312762
Variety133269143181912778
Vedomosti34163327642396128618972427354

Share and Cite

MDPI and ACS Style

Lewoniewski, W.; Węcel, K.; Abramowicz, W. Modeling Popularity and Reliability of Sources in Multilingual Wikipedia. Information 2020, 11, 263. https://doi.org/10.3390/info11050263

AMA Style

Lewoniewski W, Węcel K, Abramowicz W. Modeling Popularity and Reliability of Sources in Multilingual Wikipedia. Information. 2020; 11(5):263. https://doi.org/10.3390/info11050263

Chicago/Turabian Style

Lewoniewski, Włodzimierz, Krzysztof Węcel, and Witold Abramowicz. 2020. "Modeling Popularity and Reliability of Sources in Multilingual Wikipedia" Information 11, no. 5: 263. https://doi.org/10.3390/info11050263

APA Style

Lewoniewski, W., Węcel, K., & Abramowicz, W. (2020). Modeling Popularity and Reliability of Sources in Multilingual Wikipedia. Information, 11(5), 263. https://doi.org/10.3390/info11050263

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop