Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Wikipedia's Impact on Scientific Literature: Study on Article Creation & Similarity, Study notes of Literature

The relationship between Wikipedia and scientific literature by analyzing the creation of new Wikipedia articles and their impact on scientific articles. The study uses data from Wikipedia and scientific articles in the field of Chemistry to examine the similarity between the two corpora before and after the creation of a new Wikipedia article. The document also discusses the importance of Wikipedia as a source of cutting-edge scientific information and its potential influence on the scientific literature.

Typology: Study notes

2021/2022

Uploaded on 09/12/2022

lovefool
lovefool 🇬🇧

4.5

(21)

293 documents

1 / 41

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Science is Shaped by Wikipedia:
Evidence From a Randomized Control Trial*
Neil C. Thompson Douglas Hanley
MIT Sloan School of Management & University of Pittsburgh
MIT Computer Science and Artificial
Intelligence Lab
Abstract
“I sometimes think that general and popular treatises are almost as important for the
progress of science as original work.” Charles Darwin, 1865
As the largest encyclopedia in the world, it is not surprising that Wikipedia reflects the
state of scientific knowledge. However, Wikipedia is also one of the most accessed websites
in the world, including by scientists, which suggests that it also has the potential to shape
science. This paper shows that it does.
Incorporating ideas into a Wikipedia article leads to those ideas being used more in the
scientific literature. This paper documents this in two ways: correlationally across thousands
of articles in Wikipedia and causally through a randomized control trial where we add new
scientific content to Wikipedia. We find that the causal impact is strong, with Wikipedia
influencing roughly one in every three hundred words in related scientific journal articles.
Our findings speak not only to the influence of Wikipedia, but more broadly to the
influence of repositories of scientific knowledge. The results suggest that increased provision
of information in accessible repositories is a cost-effective way to advance science. We
also find that such gains are equity-improving, disproportionately benefitting those without
traditional access to scientific information. JEL Codes: O31, O33, O32
*The authors would like to thank MIT for research funding and for the provision of a ridiculous amount
of computing resources, Elsevier for access to full-text journal data, Dario Taraborelli at the Wikimedia
Foundation for guidance, and Caroline Fry for excellent research assistance.
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29

Partial preview of the text

Download Wikipedia's Impact on Scientific Literature: Study on Article Creation & Similarity and more Study notes Literature in PDF only on Docsity!

Science is Shaped by Wikipedia:

Evidence From a Randomized Control Trial*

Neil C. Thompson Douglas Hanley MIT Sloan School of Management & University of Pittsburgh MIT Computer Science and Artificial Intelligence Lab

Abstract “I sometimes think that general and popular treatises are almost as important for the progress of science as original work.” — Charles Darwin, 1865

As the largest encyclopedia in the world, it is not surprising that Wikipedia reflects the state of scientific knowledge. However, Wikipedia is also one of the most accessed websites in the world, including by scientists, which suggests that it also has the potential to shape science. This paper shows that it does. Incorporating ideas into a Wikipedia article leads to those ideas being used more in the scientific literature. This paper documents this in two ways: correlationally across thousands of articles in Wikipedia and causally through a randomized control trial where we add new scientific content to Wikipedia. We find that the causal impact is strong, with Wikipedia influencing roughly one in every three hundred words in related scientific journal articles. Our findings speak not only to the influence of Wikipedia, but more broadly to the influence of repositories of scientific knowledge. The results suggest that increased provision of information in accessible repositories is a cost-effective way to advance science. We also find that such gains are equity-improving, disproportionately benefitting those without traditional access to scientific information. JEL Codes: O31, O33, O

*The authors would like to thank MIT for research funding and for the provision of a ridiculous amount

of computing resources, Elsevier for access to full-text journal data, Dario Taraborelli at the Wikimedia Foundation for guidance, and Caroline Fry for excellent research assistance.

1 Introduction

In a letter to fellow biologist T.H. Huxley in 1865, Charles Darwin wrote “I sometimes think that general and popular treatises are almost as important for the progress of science as original work” (Lightman 2007, p 355). And, tellingly, On the Origin of Species was both a seminal scientific work and a bestseller (Radford 2008). This paper asks whether “general and popular treatises” themselves feed back into science and help shape it. Rephrasing this into the language of economics, we ask whether the provision of known scientific knowledge in an open, accessible repository can shape the scientific discussion of those ideas – and, in particular, whether Wikipedia already does. This is an important public policy question because it has been known since at least Samuelson (1954) that public goods, of which public repositories of knowledge are a good example, are underprovisioned in a market setting. They are thus good candidates for welfare-improving interventions by governments, organizations, and public-spirited individuals. Governments already embrace the role of providing public goods for science in a number of contexts by funding scientific repositories. These include repositories of physical objects, like seed banks (NCGRP 2005) and model organism repositories (MMRRC 2017), and there is good evidence that this promotes scientific activity (Furman and Stern, 2011). Governments also fund some informational repositories, for example those related to the human genome project (NIH 2017). Many repositories are also run by organizations or individuals. For example, StackOverflow.com, is widely used question-and-answer repository for knowledge about computer programming. Conversely, the most extensive repositories of scientific knowledge – academic journals – remain over- whelmingly financed by subscription fees, thereby restricting access. But what if many of the key insights from those journal articles were also available in an easily accessible public repository? Wikipedia is one of the largest informational public goods providers for science. It is freely available, easily accessible, and is the 5th^ most visited website in the world (Alexa 2017). A wide variety of scientific topics are covered on Wikipedia, and a substantial fraction of Wikipedia articles are on scientific topics. Depending on the definition and methods used, Wikipedia has 0.5-1.0 million scientific articles, representing one article for every ∼120 scientific journal articles. The scientific sophistication of these articles can be substantial. Based on spot testing in Chemistry, we find that Wikipedia covers more than 90% of the topics discussed at the undergraduate level at top-tier research universities, and about half of those covered at the introductory graduate level. Given this extensive coverage, it is clear that Wikipedia reflects science. But does it also shape science? Do scientists read Wikipedia articles and encounter news ideas? Or perhaps scientists encounter ideas on

is that we can look very broadly across Wikipedia articles. The disadvantage is that our results are only correlational; they cannot establish causality. This is an important weakness since it cannot rule out the plausible alternatives. Of particular worry is the risk that the correlation comes from mutual causation. In this case, that a breakthrough scientific article would generate both a Wikipedia article and follow-on articles in the scientific literature. This would induce a correlation between Wikipedia and the follow-on articles, but it would not indicate that it is Wikipedia that is shaping science. It seems obvious that this mechanism is occurring. The interesting question thus becomes if there is an additional impact that Wikipedia itself is causing. To establish the causal impact of Wikipedia, we performed an experiment. We commissioned subject matter experts to create new Wikipedia articles on scientific topics not covered in Wikipedia. These newly- created articles were randomized, with half being added to Wikipedia and half being held back as a control group.^2 If Wikipedia shapes the scientific literature, then the text from the treatment group articles should appear more often in the scientific literature than the text from the control articles. We find exactly that; the word-usage patterns from the treatment group show up more in the prose in the scientific literature than do those from the control group. Moreover, we find that these causal effects are large, and that they are equity-improving (benefiting disproportionately those who are less likely to have access to academic journals with fees).

2 Public Goods in Science

The underprovision problem of public goods is a well-researched topic. Since at least Samuelson (1954), it has been known that private incentives are less than the welfare-maximizing level because they fail to capture the spillover benefits to others. Under these conditions, there is underprovision of the public good absent intervention by governments, organizations, or public-spirited individuals. Underprovision of information goods is particularly worrisome, both because the detrimental effect when it occurs could be worse, and because the likelihood of it happening is greater. The effects can be worse because information goods can be costlessly copied and distributed. This means that underprovision could forego a “long-tail” of users that could collectively represent a substantial welfare loss. The underprovision problem might also be more likely with information goods because free-riding on informational goods may be easier than on other public goods, leading to fewer initial contributions. A common way of resolving public goods problems is to make information excludable, for example by putting information into for-pay journals.^3 Under these circumstances, those benefiting from positive spillovers will not be able to free-ride, potentially leading to better incentives for private provision, though at the cost of excluding some consumers from the market. For example, these restrictions could exclude either

(^2) Note: both sets of articles need to be written, since the analysis is lexical, and thus the wording of the control articles is

3 important. In this case, the goods should technically be called “club goods”

customers who don’t value the good very much or those who value it highly but are budget constrained. The latter would be particularly worrisome since it would represent a larger welfare loss and exacerbate inequity. The challenge of informational public goods for the scientific literature is, however, worse than the analysis above might suggest. This is because, absent actually reading a scientific article, it may be hard to assess its value to you – that is, due to Arrow’s Information Paradox (Arrow, 1962):

“there is a fundamental paradox in the determination of demand for information; its value for the purchaser is not known until he has the information, but then he has in effect acquired it without cost” So, to avoid giving away their content for free, journals need to prevent potential consumers from reading an article before they purchase it. But being unable able to read articles, it might be hard for consumers to determine their valuation (e.g. whether the article will help solve a problem). This uncertainty will render consumers unwilling to pay their full marginal value for the articles, but rather only a probability-weighted version, where the probability reflects how likely it is that the article will be valuable. As a result, even consumers with marginal values higher than the cost might choose not to purchase the article, magnifying the welfare loss. There are several distressing implications that arise from Arrow’s Information Paradox and raising the price of scientific information goods above marginal cost. First, information is likely to become siloed, with only the most valuable articles from one area crossing over to another. This is a natural consequence of the discussion above, since information in neighboring fields is likely to be less valuable and the probability of recognizing a good article is also likely to fall. Thus only the highest-quality work is likely to be paid for, and much of the potentially useful sharing of knowledge between fields will be stifled.^4 Even within a field of knowledge, the implications of this matching process between scholars and articles are likely to be troubling. It is known that the citation patterns of scientists follow a power law, and thus there exists a long tail of articles that are seldom cited.^5 For some articles, a lack of citations probably indicates a lack of quality. But other seldom-cited articles may be of high quality but targeted to a limited readership, perhaps to specialists in their narrow field.^6 In these instances, siloing is again likely to be caused by restricted access and Arrow’s information paradox. Together these examples highlight the difficult problem with scientific information. A fully open-source model will have too few private incentives, and will require substantial intervention by governments or public- spirited groups to avoid underprovision. In contrast, a fully-closed model is likely to substantially curtail welfare-improving spillovers both to those with low marginal value and those who are unable to tell if the underlying articles are worth paying for.

(^4) This also corresponds to the experience of one of the authors (Thompson) when he was in the business world and was unable

5 to gather insights from academic articles, because of the fees associated with accessing them. 6 The exact type of power law is debated, as can be seen here: https://arxiv.org/abs/1402. Increasing specialization, as observed by Wuchty et al. (2007) suggests that this issue could be becoming more prevalent.

2001 2003 2005 2007 2009 2011 2013 2015

Year

0

25

50

75

100

125

150

175

Articles (Thousands)

Wikipedia Monthly Additions

articles

words

0

50

100

150

200

250

Words (Millions)

Figure 1: Words and articles added to Wikipedia since its inception

has about 65,000 articles totalling 40 million words (Wikipedia), while English Wikipedia has about 5. million articles totalling 1.8 billion words. Wikipedia is very widely read. As of 2014, it served a total of 18 billion page views to 500 million unique visitors each month. According to Alexa, a major web analytics company, Wikipedia is the fifth most visited website on the internet, both globally and when restricting to only the US. The Wikimedia Foundation is a non-profit that operates Wikipedia, as well as numerous related projects such as Wikidata (for structured data), Wikisource (a repository for original sources), and Wiktionary (an open dictionary). It currently has over 200 employees. The website is run using open-source software, much of it developed in house in the form of the MediaWiki platform. This platform has come to be widely used by other wikis, including those not associated with the Wikimedia Foundation. In the 2015-2016 fiscal year, the Wikimedia foundation had $82 million in revenue and $66 million in expenses. To put these numbers into perspective, the American Type Culture Collection (a major biolog- ical research center) has a budget of $92 million (GuideStar, 2017), and Addgene (the non-profit plasmid repository) has a budget of $8.5 million (D&B Hoovers, 2017). A wide variety of scientific topics are covered on Wikipedia, and a substantial fraction of Wikipedia articles are on scientific topics. Determining exactly which articles do or do not constitute science is somewhat subjective. Depending on the definition and methods used, roughly 10-20% of Wikipedia articles are on scientific topics (between 0.5-1.0 million out of a total of about 5 million).^8 Based on spot testing in

(^8) To determine which articles are considered Chemistry, we rely on Wikipedia’s user generated category system. This tends to pull in far too many articles though, so we take the additional steps of paring the category tree using a PageRank criterion and hand classifying a subsample of candidate Chemistry articles and using them to train a text-based Support Vector Classifier.

Chemistry, we observe that Wikipedia covers more than 90% of the topics discussed at the undergraduate level at top-tier research universities, but only about half of those covered at the introductory graduate level. There exists substantial interest in the open-source committee for continuing to deepen the scientific knowledge on Wikipedia (Shafee et al., 2017). Wikipedia is also used by professionals, for scientific information. For example, a 2009 study of junior physicians found that in a given week 70% checked Wikipedia for medical information and that those same physicians checked Wikipedia for 26% of their cases (Hughes et al, 2009). Previous research by Biasi and Moser on German textbooks in WWII (2017) showed that lowering the cost of scientific information (and thus making it more accessible) led to substantial changes in scientific publishing. Since Wikipedia is also making scientific information cheaper and more widely accessible, we would expect that it too would have an influence on the scientific literature. However, evidence of this effect is largely absent from the usual place where one would look for it: citations from the academic literature. Tomaszewski and MacDonald (2016) find that only 0.01% of scientific articles directly cite Wikipedia entries. We hypothesize that this is not because Wikipedia doesn’t have an effect, but rather that academic citations are not capturing the effect that Wikipedia has. To test this, we develop a text-based measure, where we can measure this effect directly in the words used by scientists.

4 Data

This paper relies on four major sources of data. The first is a complete edit history of Wikipedia, that is all changes to all Wikipedia pages since its inception. The second is a full-text version of all articles since 1995 across more than 2,061 Elsevier journals, which we use to represent the state of the scientific literature. The third is data on citations to academic journals, which we get from Web of Science. These three sources are described in this Section. The fourth data source is a set of Wikipedia articles created as part of the randomized control experiment. We discuss these as part of the experimental design in Section 7.

4.1 Wikipedia

The Wikimedia Foundation provides the full history of all edits to each article on Wikipedia. This includes a variety of projects run by the foundation, in particular, the numerous languages in which Wikipedia is published. For the purposes of this study, we focus only on the English corpus, as it is the largest and most widely used. Even restricting to English Wikipedia, there are numerous non-article pages seldom seen by readers. This includes user pages, where registered users can create their own personalized presence; talk pages, one for each article, where editors can discuss and debate article content and editing decisions; pages associated with hosted media files such as images, audio, and video; and much more.^9

(^9) There are also redirect pages that allow for multiple name variants for a single source page. These are safely ignored.

of articles and words being added each month. Generally speaking, new Wikipedia articles start out quite small and grow slowly over time. Roughly 70% of articles are less than 20 words long upon creation, reflecting the fact that many article begin as a “stub” – a short article, perhaps just a title and single descriptive sentence, that is intended to be built upon in the future. Figure 3 shows an example of an early edit of the Magnesium Sulfate stub, where new additions are underlined and deletions are struck through.

“Magnesium sulfate , ” MgSO 4 , (commonly known as called“ Epsom salts salt” in hydrated form) is used as a therapeutic bath a chemical compound with formula MgSO 4 , Epsom salt was originally prepared by boiling down mineral waters at Epsom, England and afterwards prepared from sea water. In more recent times, these salts are obtained from certain minerals such as siliceous hydrate of magnesia.

Figure 3: Example of the early editing on the Magnesium Sulfate article

Figure 4 plots the size distribution of newly created articles that are longer than 20 words. Here we can see that the bulk of articles begin at less than 200 words. There is some mass in the tails of the distribution, though this may be due to the renaming or reallocation of large existing articles.

200 400 600 800 1000 Word Count

0

5000

10000

15000

20000

25000

30000

Number of Articles

Figure 4: Size distribution of new articles longer than 20 words

In Figure 2 there was some evidence of tapering off in the number of chemistry and econometrics articles being created. This is likely because many of the most important topics in these fields has already been added. Figure 5 shows corroborative evidence of this by plotting how articles grow on average. Interestingly, all three cohorts average approximately 250 words when first written. Article lengths expand significantly

after this, but particularly so for the earlier cohort – again suggesting that early Wikipedia articles were on broader, more important topics.

0 2 4 6 8 10 12 14

Age (Years)

0

1000

2000

3000

4000

5000

6000

Word Count

Figure 5: Average size of articles conditional on age (daily)

Finally, in Figure 6 we present the current distribution of article size conditional on being larger than 30 words. Here we see the characteristic long tail extending nearly linearly in log-log space. There are also a large number of articles with very few words. We exclude such “stub” articles from our analysis by imposing a minimum of 500 characters in each article for inclusion in our sample.

4.1.2 Word Coverage

For our word-level analysis we focus on the Top 90% most common unique words in the scientific journals. The entire vocabulary that we could potentially use contains ∼1.2M words, of which we focus on the most- used ∼1.1M in science. This serves two purposes: (1) it eliminates noise from words with single digit frequencies (and thus where there are large proportional swings in usage), and (2) it avoids issues arising from errors in parsing, non-content strings (such as URLs), and misspellings in the source text. It should be noted that this set of words will include very common ones such as “the” and “a.” In subsequent analysis, we use inverse document frequency weighting, which ensures that our results are not being driven by these words. Even after such a cull, the words represented here account for 99% of word usage in science and 72% of word usage in the Chemistry pages of Wikipedia. We can also consider the overlap in the two vocabularies. We see that they are similar, but also have substantial differences. About 61% of the words in the scientific literature appear in Wikipedia, while amongst the set of words appearing in Wikipedia, about 63% appear in science. The following provides some

level category (say, Chemistry), find all of its descendant subcategories, then find all pages belonging to such categories. Unfortunately, this pulls in a large number of false positives. For example, following the hierarchy from Chemistry yields: Chemistry > Laboratory Techniques > Distillation > Distilleries > Yerevan Brandy Company (an Armenian cognac producer). To correct for false positives, we hand classify a set of 500 articles and use these to train a support vector machine (SVM) classifier. The SVM maps vectors of word frequencies into a binary classification (in the field or not). The SVM is standard technique in machine learning for tackling high dimensional classification problems such as this one. In the case of chemistry, this process narrows the set of 158,000 potential articles to 27,000 likely chemistry articles in Wikipedia.

4.2 Scientific Literature

The data on the scientific literature is provided by Elsevier publishing company and includes the full text of articles published in their journals. This is useful for us, since it allows us to look for the words used in the scientific literature and whether they reflect those used in Wikipedia. In addition, we make use of the article metadata provided, such as author and publication date. The entire dataset includes 2,061 journals over many years. Since we are interested in the interaction of the scientific literature with Wikipedia, we use only data from 2000 onward. For each article, we observe the journal that it is published in, the year of publication, the journal volume and issue numbers, the title and author of the article, and the full text. We don’t make use of any image data representing figures or charts, and equations, since our analysis is word-based. Finally, since journal publication time is often poorly documented (saying, for example, “Spring 2009”), we hand collect this info at the journal issue level for the journals we use. Focusing specifically on the chemistry literature, which we examine in particular detail, we look at 50 of the highest impact journals, constituting 745,000 articles. Of these, we focus on the 326,000 that are from after 2000.

4.3 Web of Science Citation Data

The data on academic citations is provided by Web of Science. It provides directional links, indicating which papers cite which other ones. This information is also aggregated to provide total yearly citation counts for each paper.

to high PageRank nodes. This eliminates only 1% of nodes and renders the graph acyclic (a tree).

5 Observational Analysis Methodology

The purpose of this first analysis is to establish the broad correlations between word usage in Wikipedia and word usage in the scientific literature. The intent in this section is not to establish causation, but rather to observe in general how contemporaneous these changes are across many different areas in Wikipedia. In Sections 6.3 and 6, we assess how much of these effects are attributable to a causal effect of Wikipedia on Science.

5.1 Word Co-occurrence

5.1.1 Documents

In addition to analyzing the usage of individual words, we also take advantage of their arrangement into documents in both corpora. Given a certain set of possible words (a vocabulary) of size K, each document can be represented by a K dimensional vector in which each entry denotes the number of appearances of a particular word. This is referred to as a bag-of-words model, because information on word positions within the text are discarded. These vectors are generally extremely sparse, since only a small number of words are represented in any document. We can now define the cosine similarity metric between two documents with vectors v 1 and v 2 as

d(v 1 , v 2 ) =

√ (^) v 1 · v 2 ‖v 1 ‖ ‖v 2 ‖

where ‖v‖ = √v 1 · v 2. This satisfies the natural properties that: (1) d(v 1 , v 2 )∈[0, 1]; (2) d(v, v) = 1, and (3) d(v 1 , v 2 ) = 0 when v 1 and v 2 have non-overlapping bases. To account for the fact that some words carry more meaning than others, we utilize term frequency-inverse document frequency (tf-idf) weighting to inflate the relative weight given to rarer (and presumably more important) words. In particular, this scheme weights tokens by the inverse of the fraction of all documents that the token appears in. This is a standard metric used text analysis problems. Most articles in the scientific literature are not that similar to any given Wikipedia article. Figure 7 shows this empirically, plotting the average similarity between all pairs of Wikipedia and scientific articles in our Chemistry sample. To estimate the effect of Wikipedia on Science from our observational data requires two elements: an observed correlation between Wikipedia and the change in language use in the scientific literature (raw effect), and a counterfactual about how word usage would have changed absent the Wikipedia article. We calculate the latter first, by considering language drift.

5.1.2 Language Drift

Over time, the usage of words in scientific writing naturally varies. Sometimes this is due to non-fundamental changes in terminology, and other times this is due to the advent of genuinely new concepts or discoveries,

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0. Document Similarity

100

75

50

25

0

25

50

75

100

Change in # of Articles (%)

Figure 8: Simulated drift in science-Wikipedia document similarity

The distribution provided by this drift analysis provides a baseline against which we can measure the raw observational correlation associated with adding a Wikipedia article.

5.1.3 Specifications

We calculate the (raw) effect of adding a Wikipedia article using a regression approach. Let us denote the cosine similarity between Wikipedia article i and scientific article j at time t as dijt. This notational will include all articles pairs, even those where the scientific article was published before the Wikipedia article. Thus, let us also denote by wijt the binary indicator of whether scientific article j was written after Wikipedia article i. With our notation defined, we can state the precise specification we use. dijt ∼ α + τ ∗ wijt

This is essentially a difference in means. With estimates of both the raw treatment effect, τ , and the counterfactual drift tendency, the net effect of adding a Wikipedia article can be calculated as: ω = τ − δ.

5.2 Measurement Timeline

In order to examine the relationship between Wikipedia and science, we look at scientific articles shortly before and shortly after the appearance of new Wikipedia article. Our hypothesis is that if Wikipedia has an impact on the progression of the literature, science published after the creation of the Wikipedia article will bear a closer similarity to the article than the science published before it did.

We consider an article to have been “created” three months after its first appearance. We impose this delay to reflect a common article creation sequence in Wikipedia wherein someone (such as an editor) indicates that a new page should be written and creates a placeholder for it (a “stub”), after which subsequent edits are made to fill in the page. Such stubs are a prevalent phenomenon on Wikipedia (Shafee et al., 2016), and absent this choice we would have a large number of articles “created” with almost no content.^13 We look for effects of the Wikipedia article on science at two time windows around article creation, one six-month window preceding it and one six-month window starting three months after it. The latter delay also accounts for publication lags in science. The following diagram explains this:

Pre-window (6 months)

Delay (3 months)

Post-window (6 months)

Wikipedia publication

Figure 9: Measurement timeline

For each Wikipedia article, there is a certain set of scientific articles associated with the pre and post windows, respectively. This induces a distribution of similarities (pre and post) for each Wikipedia article. In our analysis, we look at the average difference between these pre and post distributions. If the post distribution shifts closer to the Wikipedia article, it suggests an increased correlation between Wikipedia and the scientific articles.

6 Observational Analysis Results

6.1 Overall correlations

Figure 10 plots the log frequencies of tokens with above median frequency in both Wikipedia and science. The red line shows an OLS regression fit, indicating a strong positive correlation. Here we can see that there is a rather strong relationship between the relative frequencies of words in the two corpora. Nonetheless, there are considerable difference in usage frequencies at all levels.

6.2 Event studies

Below are some examples of token frequency time series in the present vocabulary. Each is shown starting from 2001, when Wikipedia started, until nearly the present day. From these we can see that there is certainly (^13) An example of a newly-created article with almost no content can be seen at https://en.wikipedia.org/wiki/Paracamelus, as of April 2017.

2001 2003 2005 2007 2009 2011 2013

21

20

19

18

17

16

Log Frequency

Word Frequency: "graphene"

2001 2003 2005 2007 2009 2011 2013

Log Frequency

Word Frequency: "photovoltaic" science wikipedia

2001 2003 2005 2007 2009 2011 2013

Log Frequency

Word Frequency: "ozone"

2001 2003 2005 2007 2009 2011 2013

Log Frequency

Word Frequency: "reaction"

Figure 11: Sample token frequency time series in Chemistry

Table 2: Observational Effect of new Wikipedia Article (not accounting for language drift)

Similarity

Intercept 0.^1309

∗∗∗

After 0.^0011

∗∗∗

N 681875

R^2 0.

Adjusted R^2 0.

F Statistic 60.

Note: *p < 0 .1; **p < 0 .05; ***p < 0. 01

over the similarity metric between the Wikipedia article and its associated in-window scientific article, pooling over all Wikipedia articles. We then look at the difference in density between pre and post windows. Because

both pre and post are densities, this differential must integrate to zero, reflecting where the relative densities have risen and fallen.

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.

Document Similarity

30

20

10

0

10

20

Change in # of Articles (%)

Figure 12: Evolution of the similarity between Wikipedia and science after a new Wikipedia article (absolute change in density), not accounting for language drift

We can see in Figure 12 the shift from the lower similarity levels between 0.1 and 0.2 to higher levels, namely those above 0.2.^14 What may not be as clear from this plot is the proportional size of these changes. Finding similarity levels above 0.3 is relatively rare, meaning the increases in the right tail of this graph is a meaningful increase. Figure 13 reflects this, showing the proportional change at each level of similarity. Recall that the estimate from Table 2 represents only the raw effect, of 0.11%. This effect needs to be contrasted with the counterfactual drift value calculated in 5.1.2 of 0.07%. Thus we get a correlational effect of ω = 0.11% − (− 0 .07%) = 0.18%. Our experimental setup, presented next, will not require this drift adjustment because the control group can be used directly as the counterfactual. Although the correlations presented in this section are suggestive, they are not causal. It is possible that they represent an effect that Wikipedia is having on the scientific literature. But such effects are indistinguishable from another causal pathway, new scientific topics generating a Wikipedia article and more follow-on work. Separating these two effects requires turning to our experimental results.

(^14) These effects are similar if other window sizes are used