Mining knowledge from Wikipedia: Announcing WEX

Growing at approximately 1,700 articles a day, Wikipedia is a significant repository of human knowledge. With its focus and depth, Wikipedia has emerged as a public good of information, fueling a small industry of computer science research. And though Wikipedia contains a wealth of collective knowledge, due to is idiosyncratic markup and semi-structured design, developers wishing to utilize this resource each incur significant start-up costs simply handling, parsing and decoding the raw corpus.

Like many others, the Metaweb team developed specialized techniques to handle Wikipedia data as a part of our effort to extract facts and incorporate Wikipedia references into Freebase. Several months ago, Nick Thompson prototyped a general approach to Wikipedia processing using Magnus Manske’s Wiki2XML parser, which would allow various and new extraction techniques to operate on a standardized representation. Through the efforts of Colin Evans, Al Marks and several other Metaweb developers the Wikipedia Extraction (WEX) system became a reliable, repeatable process to normalize the Wikipedia corpus into a simple, manageable data set.

Starting today, Metaweb is releasing a downloadable copy of the WEX data set. The WEX representation allows anyone, using simple text, SQL and XML manipulations to programmatically access the contents and structure of Wikipedia. Going forward, Metaweb will periodically release updated WEX data as new Wikipedia dumps are pushed through the WEX processing pipeline.

The WEX data set can be downloaded from http://download.freebase.com/wex/. The README document which accompanies the data set describes the WEX representation in detail and developers interested in discussing WEX and its use should subscribe to the Freebase developer mailing list.

In addition to information about Wikipedia’s structure and normalized article markup in XML, the WEX data provides references to Freebase Topics allowing developers to access Freebase Type (is-a) and Property (has-a) information as well as references (namespace-keys) to other data sources which have been incorporated into Freebase. This additional information can be used to aid various machine learning and natural language processing extraction techniques.

It is our hope that by making the contents of Wikipedia more accessible, developers will explore myriad new techniques for analyzing the wealth of information captured by the Wikipedia community.

7 Responses to “Mining knowledge from Wikipedia: Announcing WEX”

  1. Sam Vilain Says:

    Are there any plans to re-run the importers using the historical data available? Often the best versions of articles are to be found in the history as some editors remove some information they think is unnecessary, which makes other statements sound weird and get removed, etc.

    Git – a file distribution and archival system sometimes thought of as an archival system – would be an excellent choice for this. You can make a git repository that has the “all.tar” file extracted in it, commit, and then every time a new “all.tar” is produced, make it a new commit. This will make the changes to Wikipedia over time much more accessible.

  2. Sam Vilain Says:

    >>Git – a file distribution and archival system sometimes thought of as an archival system

    Er, sometimes thought of as a *version control* system :)

  3. Sam Vilain Says:

    Further note – the files going into the git repositories should not be compressed before they are added to the repository; git’s compression is actually more capable of compressing than simple stream-based systems in any case.

    Also, if historical versions are produced they can be “grafted” on to the current day versions.

  4. Vishnu Vyas Says:

    Are there plans to do the same thing to wikipedia for other languages? People in the Computational Linguistics and MT would be very grateful for that..

  5. skud Says:

    Hi Sam, good to see you here!

    I don’t believe we have any plans to go back into Wikipedia’s past, but we will be reconciling future changes so that we maintain the best data we can. Someone who deals more with data import than I do would have to confirm this, but I believe that the import to Freebase won’t delete an assertion that already exists, so if someone removed something from Wikipedia it would still remain in Freebase.

    Also note that we mostly parse structured data from Wikipedia, such as category listings and infoboxes, not articles, and the content of those is (I suspect) less likely to undergo the sort of organic change you describe than the full-text articles is.

  6. skud Says:

    Vishnu, I’m not aware of any such plans at the moment, but someone on the data team might be able to clue us both up a bit better. As far as I know, we don’t currently import Wikipedias in other languages into Freebase, and since WEX is a by-product of that, it means we don’t have multilingual data dumps just lying around.

  7. Martin Hepp Says:

    Hi,
    as a related effort: We’ve studied both the conceptual reliability over time and the type of topics of Wikipedia URIs – a copy of the full paper [1] is available at

    http://www.heppnetz.de/files/hepp-siorpaes-bachlechner-harvesting%20wikipedia%20w5054.pdf

    More data and background information is at

    http://www.heppnetz.de/harvesting-wikipedia/

    Best
    Martin

    [1] Harvesting Wiki Consensus: Using Wikipedia Entries as Vocabulary for Knowledge Management, IEEE Internet Computing, Vol. 11, No. 5, pp. 54-65, Sept-Oct 2007.

About

Freebase is a free database of the world's information. This is the official Freebase blog.