Do you recognize this man?
Believe it or not, he’s one of our biggest contributors to Freebase. He goes by many names that you may have seen on our pages — mwcl_wikipedia_en, gns_bot, mw_template_bot, mw_prop_bot, mergebot are just a few of his many aliases. He works tirelessly day and night. And, as you might have guessed, he’s not even human.
He’s Bot — the collection of all of the automated processes in Freebase’s data mining pipeline. Freebase thrives on the content that you, our user community, creates. But we also want to make sure that there’s a a wealth of raw information already there for you to draw upon when you’re adding your data. That’s where Bot comes in. By automatically gathering, cleaning and formatting data from around the web, Bot provides the Freebase community with a huge amount of data for “free”, letting you focus on the specific information your Base, View or Schema requires.
Bot starts his processing with that amazing repository of online information, Wikipedia. Over the past 7 years, Wikipedia has seen remarkable growth. There are currently 2.5 million articles in Wikipedia, on topics as wide-ranging as people, places, history, politics, music, culture, and science. There are articles on companies, consumer goods, and pop media — and the number of articles is still doubling every 12-18 months. If a topic is of any importance at all, it’s a good bet that someone has written an article about it in Wikipedia.
But Wikipedia is written by humans, for humans. This is great if you need learn about something, or quickly look up a fact. But there are things that are difficult to do in Wikipedia. It’s hard to ask questions, like “What movies by George Lucas has Harrison Ford starred in?” or “Who are all of the characters in the Final Fantasy franchise?” or “Who are all the Popes of the Roman Catholic Church?” or “What are the tallest buildings in San Francisco?“. It’s also hard to build applications on top of Wikipedia’s unstructured data.
Bot, however, has a way of extracting information from Wikipedia, known as WEX (short for “Wikipedia EXtractor”). Run on Freebase’s Hadoop cluster, WEX is a MapReduce process that translates the Wikipedia’s markup language into an easy-to-parse XML format. With this XML format in hand (or actuator, as the case may be), Bot can then identify the articles, redirects, and IDs of the new wikipedia pages, along with a wealth of associated information like article blurbs, templates, images, and categories. (As a side note, Freebase makes downloads of our WEX-parsed files freely available to the community on a quarterly basis. If you’re computationally savvy, check it out — it’s a treasure trove of raw data. And if you need processing tools, you may want to check our Happy, our open-source Python extensions to Hadoop.)
Bot then goes through and looks for new articles in Wikipedia that weren’t there the last time he checked. After filtering out unsuitable articles (lists and the like), he collates the article title, the redirects that point to it, a 1200-character blurb, and an image or two, and creates a stub article. The article title becomes the name, and the redirects become aliases. At the end of it all, Bot gives a “stub article”, just waiting for a user like you to come along and add it to a base.
But the real value comes from the semistructured data that is available within Wikipedia. Consider the “InfoBox“, a feature often seen on Wikipedia pages. Wikipedians use these templates to organize key information about their articles while giving a common look and feel.
If you look behind the scenes at Wikipedia, you’ll see that this handy summary is created by a bit of structured wikimarkup text, shown on the right. Bot *loves* data like this. Here, in nicely-structured, computer-readable text, are a variety of assertions that look very much like freebase properties. Not only does it tell us that “Boeing 777 is an aircraft” (or, in freebase terms, “/en/boeing_777” has a type of “/aviation/aircraft_model“, but it also tells us many of the properties of that type. We also know that “Boeing 777″ has a “manufacturer” of “Boeing Commericial Airplanes” (or, in freebase-speak, /en/boeing_777 –> /aviation/aircraft_model/manufacturer –> /en/boeing_commercial_airplanes), and that “Boeing 777″ had a “maiden flight” on “Jun 12, 1994″ (/en/boeing_777 –> /aviation/aircraft_model/maiden_flight –> 1994-06-12).
Since WEX parses all of the InfoBoxes properties (and other templates) within wikipedia — all 45 million of them! — Bot has them at his disposal to create facts in Freebase. Of course, he needs humans to tell him what the right property is for a given InfoBox — but we’ve fed him thousands of these mappings already, which has let him assert 3.9 million high-quality properties based on InfoBoxes.
Similarly, Wikipedia category pages contain information about the types of the topics listed in the category. For instance, if you see “San Francisco Giants” on a category page called “Baseball teams in California“, you can be reasonably sure that “/en/san_francisco_giants” should be typed as a “/baseball/baseball_team“. WEX has over 7 million category-article associations like this — from which Bot has asserted over 1.2M types.
This is just the start. Here at Freebase, we’re constantly adding new functionality to Bot. For instance, there are Bots to make sure that all valid mailing addresses in the US are geocoded with a valid latitude and longitude, so that they can be visualized on tools like Google Maps. We also have Bots to load data about newly released films, including actor and director information. There are Bots to clean up messy and invalid data, Bots to process scheduled merges and deletes, and Bots to find new images for topics. The are even bots that mark the passing of the dearly departed. And more is planned — soon we plan on having Bots that provide you with data about books, films, consumer electronics, and more!
Do you have tasks that you think Bot should be assigned? Data that he should focus his unwavering attention on? We’d love to hear from you — let us know your thoughts!


