Growing at approximately 1,700 articles a day, Wikipedia is a significant repository of human knowledge. With its focus and depth, Wikipedia has emerged as a public good of information, fueling a small industry of computer science research. And though Wikipedia contains a wealth of collective knowledge, due to is idiosyncratic markup and semi-structured design, developers wishing to utilize this resource each incur significant start-up costs simply handling, parsing and decoding the raw corpus.
Like many others, the Metaweb team developed specialized techniques to handle Wikipedia data as a part of our effort to extract facts and incorporate Wikipedia references into Freebase. Several months ago, Nick Thompson prototyped a general approach to Wikipedia processing using Magnus Manske’s Wiki2XML parser, which would allow various and new extraction techniques to operate on a standardized representation. Through the efforts of Colin Evans, Al Marks and several other Metaweb developers the Wikipedia Extraction (WEX) system became a reliable, repeatable process to normalize the Wikipedia corpus into a simple, manageable data set.
Starting today, Metaweb is releasing a downloadable copy of the WEX data set. The WEX representation allows anyone, using simple text, SQL and XML manipulations to programmatically access the contents and structure of Wikipedia. Going forward, Metaweb will periodically release updated WEX data as new Wikipedia dumps are pushed through the WEX processing pipeline.
The WEX data set can be downloaded from http://download.freebase.com/wex/. The README document which accompanies the data set describes the WEX representation in detail and developers interested in discussing WEX and its use should subscribe to the Freebase developer mailing list.
In addition to information about Wikipedia’s structure and normalized article markup in XML, the WEX data provides references to Freebase Topics allowing developers to access Freebase Type (is-a) and Property (has-a) information as well as references (namespace-keys) to other data sources which have been incorporated into Freebase. This additional information can be used to aid various machine learning and natural language processing extraction techniques.
It is our hope that by making the contents of Wikipedia more accessible, developers will explore myriad new techniques for analyzing the wealth of information captured by the Wikipedia community.

February 19th, 2008 at 1:30 pm
Are there any plans to re-run the importers using the historical data available? Often the best versions of articles are to be found in the history as some editors remove some information they think is unnecessary, which makes other statements sound weird and get removed, etc.
Git – a file distribution and archival system sometimes thought of as an archival system – would be an excellent choice for this. You can make a git repository that has the “all.tar” file extracted in it, commit, and then every time a new “all.tar” is produced, make it a new commit. This will make the changes to Wikipedia over time much more accessible.
February 19th, 2008 at 1:31 pm
>>Git – a file distribution and archival system sometimes thought of as an archival system
Er, sometimes thought of as a *version control* system :)
February 19th, 2008 at 1:38 pm
Further note – the files going into the git repositories should not be compressed before they are added to the repository; git’s compression is actually more capable of compressing than simple stream-based systems in any case.
Also, if historical versions are produced they can be “grafted” on to the current day versions.
February 19th, 2008 at 2:59 pm
Are there plans to do the same thing to wikipedia for other languages? People in the Computational Linguistics and MT would be very grateful for that..
February 19th, 2008 at 10:51 pm
Hi Sam, good to see you here!
I don’t believe we have any plans to go back into Wikipedia’s past, but we will be reconciling future changes so that we maintain the best data we can. Someone who deals more with data import than I do would have to confirm this, but I believe that the import to Freebase won’t delete an assertion that already exists, so if someone removed something from Wikipedia it would still remain in Freebase.
Also note that we mostly parse structured data from Wikipedia, such as category listings and infoboxes, not articles, and the content of those is (I suspect) less likely to undergo the sort of organic change you describe than the full-text articles is.
February 19th, 2008 at 10:58 pm
Vishnu, I’m not aware of any such plans at the moment, but someone on the data team might be able to clue us both up a bit better. As far as I know, we don’t currently import Wikipedias in other languages into Freebase, and since WEX is a by-product of that, it means we don’t have multilingual data dumps just lying around.
March 31st, 2008 at 3:39 am
Hi,
as a related effort: We’ve studied both the conceptual reliability over time and the type of topics of Wikipedia URIs – a copy of the full paper [1] is available at
http://www.heppnetz.de/files/hepp-siorpaes-bachlechner-harvesting%20wikipedia%20w5054.pdf
More data and background information is at
http://www.heppnetz.de/harvesting-wikipedia/
Best
Martin
[1] Harvesting Wiki Consensus: Using Wikipedia Entries as Vocabulary for Knowledge Management, IEEE Internet Computing, Vol. 11, No. 5, pp. 54-65, Sept-Oct 2007.