<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>The Freebase Blog &#187; Data loads</title>
	<atom:link href="http://blog.freebase.com/category/data/data-loads/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.freebase.com</link>
	<description>A blog for data geeks, application developers and interested civilians</description>
	<lastBuildDate>Wed, 25 Nov 2009 01:29:13 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>10 Million Topics!</title>
		<link>http://blog.freebase.com/2009/11/24/10-million-topics/</link>
		<comments>http://blog.freebase.com/2009/11/24/10-million-topics/#comments</comments>
		<pubDate>Wed, 25 Nov 2009 00:43:51 +0000</pubDate>
		<dc:creator>zenkat</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[Data loads]]></category>
		<category><![CDATA[freebase]]></category>
		<category><![CDATA[milestone]]></category>

		<guid isPermaLink="false">http://blog.freebase.com/?p=1279</guid>
		<description><![CDATA[Pop open the Champagne!
Freebase has passed a notable milestone.  On Sunday, at about 11:00am PST, we zoomed by our 10 millionth topic &#8212; and by the time you read this post, we should surpass the 11 million topic mark.  A year ago, Freebase stood at just over 4 million topics.  That&#8217;s an annual growth rate [...]]]></description>
			<content:encoded><![CDATA[<p>Pop open the <a href="http://www.freebase.com/view/en/champagne">Champagne</a>!</p>
<p>Freebase has passed a notable milestone.  On Sunday, at about 11:00am PST, we zoomed by our 10 millionth topic &#8212; and by the time you read this post, we should surpass the 11 million topic mark.  A year ago, Freebase stood at just over 4 million topics.  That&#8217;s an annual growth rate of over 100%.</p>
<p><img class="alignright" src="http://rystarr.files.wordpress.com/2009/06/fireworks1.jpg?w=323&amp;h=283" alt="Celebrate!" width="323" height="258" /></p>
<p>A great deal went into achieving this milestone &#8212; contributions from prolific community members like <a href="http://www.freebase.com/view/user/tfmorris">tfmorris</a>, <a href="http://www.freebase.com/view/user/pak21">pak21</a> and <a href="http://www.freebase.com/view/user/sprocketonline">sprocketonline</a> (see our <a href="http://experthub.freebaseapps.com/lb-contributions">contribution leaderboard</a> for more); new Data Team tools like the <a href="http://data.labs.freebase.com/recon/">recon service</a>, RABJ, and the <a href="http://data.labs.freebase.com/recon/recon.html">spreadsheet loader</a>; and continued growth in traditional data sources like <a href="http://download.freebase.com/wex/">Wikipedia</a>.  But the largest segment of growth came from our continuing efforts to build a comprehensive repository of high-quality information about media in all its forms &#8212; especially <a href="http://www.freebase.com/view/music">music</a>, <a href="http://www.freebase.com/view/film">movies</a>, <a href="http://www.freebase.com/view/tv">TV</a> and <a href="http://www.freebase.com/view/book">books</a>.</p>
<p>In October, we rounded out our TV domain by synchronizing with the excellent user-curated TV fan site <a href="http://www.tvrage.com">TVRage.com</a>.  Combined with earlier data loads from <a href="http://www.thetvdb.com">thetvdb.com</a>, we now have comprehensive coverage of nearly every TV <a href="http://www.freebase.com/view/tv/tv_program">show</a> and <a href="http://www.freebase.com/view/tv/tv_series_episode">episode</a> created in the United States.  It includes cast and credits, as well as links to key TV websites like <a href="http://www.tvguide.com/">tvguide.com</a> and <a href="http://www.hulu.com/">Hulu</a> &#8212; nearly a million topics in all!</p>
<p>But the load that took us over the 10 million mark was the final load of editions from <a href="http://openlibrary.org/">Open Library</a>.  Compromising 650,000 authors, almost 2 million books and 2.1 million book editions,   this load pushed new boundaries in our data acquisition, curation, reconciliation and QA processes.</p>
<p>In the months ahead, we&#8217;ll be continuing to both curate and extend our media data loads with more high-quality data sets.  We plan on continuing to reconcile authors and books already in Freebase, as well as loading more books from curated bibliographic catalogs.  We&#8217;ll also be fleshing out our data about movies with data from <a href="http://developer.netflix.com/">Netflix</a>, as well as restarting our regular synchronizations with <a href="http://www.musicbrainz.org">MusicBrainz</a> and their <a href="http://wiki.musicbrainz.org/Next_Generation_Schema">Next Generation Schema</a>.</p>
<p>Congratulations to everyone who helped get us to this point.  It&#8217;s been an exciting year &#8212; with more great data to come!</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.freebase.com/2009/11/24/10-million-topics/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Freebase now has 8.4 million topics</title>
		<link>http://blog.freebase.com/2009/08/12/freebase-now-has-8-4-million-topics/</link>
		<comments>http://blog.freebase.com/2009/08/12/freebase-now-has-8-4-million-topics/#comments</comments>
		<pubDate>Wed, 12 Aug 2009 20:52:13 +0000</pubDate>
		<dc:creator>skud</dc:creator>
				<category><![CDATA[Applications]]></category>
		<category><![CDATA[Data loads]]></category>
		<category><![CDATA[Freebase Commons]]></category>
		<category><![CDATA[books]]></category>
		<category><![CDATA[olp]]></category>
		<category><![CDATA[tippify]]></category>
		<category><![CDATA[tv]]></category>
		<category><![CDATA[tv4me]]></category>
		<category><![CDATA[tvrage]]></category>

		<guid isPermaLink="false">http://blog.freebase.com/?p=1138</guid>
		<description><![CDATA[Remember not so long ago when we were very excited about crossing the 5 million mark?  Well, in the last couple of weeks, we suddenly raced past 6, 7, and 8 million topics in Freebase, to reach a current total of 8,450,348 topics.
This is largely thanks to two big loads from the data team: [...]]]></description>
			<content:encoded><![CDATA[<p>Remember not so long ago when we were very excited about crossing the <a href="http://blog.freebase.com/2009/04/08/five-million-topics/">5 million</a> mark?  Well, in the last couple of weeks, we suddenly raced past 6, 7, and 8 million topics in Freebase, to reach a current total of <a href="http://www.freebase.com/explore">8,450,348 topics</a>.</p>
<p>This is largely thanks to two big loads from the data team: first up, a massive import of millions of books and related information from the <a href="http://openlibrary.org/">Open Library Project</a>, and secondly a big load of around 255k TV episodes from <a href="http://tvrage.com/">TVRage</a>.</p>
<p>Here&#8217;s a chart showing topic growth in Freebase since the beginning of April:</p>
<div id="attachment_1143" class="wp-caption aligncenter" style="width: 603px"><img src="http://blog.freebase.com/wp-content/uploads/2009/08/fb-growth2.png" alt="Topic growth in Freebase, April to August 2009" title="Topic growth in Freebase" width="593" height="360" class="size-full wp-image-1143" /><p class="wp-caption-text">Topic growth in Freebase, April to August 2009</p></div>
<p>To take a look at the relevant data, visit our <a href="http://www.freebase.com/view/book">Publishing Commons</a> or <a href="http://www.freebase.com/view/tv">TV Commons</a>.  For some apps taking advantage of all this data, look at <a href="http://tippify.com">Tippify</a> for book recommendations, or Alex&#8217;s <a href="http://tv4me.freebaseapps.com/">TV4Me</a> app (currently in development), a mashup of Freebase&#8217;s TV data with US broadcast schedules.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.freebase.com/2009/08/12/freebase-now-has-8-4-million-topics/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Our latest mass data load: science fiction books</title>
		<link>http://blog.freebase.com/2009/04/14/science-fiction-books/</link>
		<comments>http://blog.freebase.com/2009/04/14/science-fiction-books/#comments</comments>
		<pubDate>Wed, 15 Apr 2009 00:36:55 +0000</pubDate>
		<dc:creator>zenkat</dc:creator>
				<category><![CDATA[Data loads]]></category>

		<guid isPermaLink="false">http://blog.freebase.com/?p=500</guid>
		<description><![CDATA[Just in time for the Hugo Awards &#8230; the Internet Speculative Fiction Database has landed at Freebase!
We&#8217;re big fans of the ISFDB.  Fan-created and maintained, it is widely considered one of the most authoritative sources about Science Fiction, Fantasy, and Horror literature available on the Internet.  It obsessively documents the literary careers of thousands of [...]]]></description>
			<content:encoded><![CDATA[<p>Just in time for the <a title="Hugo Awards 2009 Nominations" href="http://www.thehugoawards.org/?p=260">Hugo Awards</a> &#8230; the <a title="ISFDB" href="http://208.100.59.10/cgi-bin/index.cgi">Internet Speculative Fiction Database</a> has landed at Freebase!</p>
<p>We&#8217;re big fans of the <a title="ISFDB Entry @ Freebase" href="http://www.freebase.com/view/en/internet_speculative_fiction_database">ISFDB</a>.  Fan-created and maintained, it is widely considered one of the most authoritative sources about Science Fiction, Fantasy, and Horror literature available on the Internet.  It obsessively documents the literary careers of thousands of authors, including the series they&#8217;ve created, the titles they&#8217;ve written, and the awards they&#8217;ve won.  And, in great Open Data fashion, it&#8217;s available under a <a title="CC 2.0 License" href="http://creativecommons.org/licenses/by/2.0/">Creative Commons</a> license.</p>
<p>And now the contents of the ISFDB are available in Freebase.  On our first load, we&#8217;ve focused exclusively on the book-length works within ISFDB: novels and short-story collections, nicely complementing the <a title="Freebase Blog: 100K Books" href="http://blog.freebase.com/2009/01/05/100000-books/">100K books loaded by Toby</a> back in January.</p>
<p>As always, we&#8217;ve taken great care to reconcile these new entries from ISFDB with items already in Freebase.  When we do this, we split our load into three groups: items that are clearly not in Freebase (new); items that clearly match to something already in Freebase (match); and items where we can&#8217;t make an automated decision given the available data (unclear).  We&#8217;ve already loaded items from the first two groups.  Items in the &#8220;unclear&#8221; group still need to be manually reviewed before they can be loaded.</p>
<ul>
<li>Authors: 9,212 new and 3,226 matched (67% coverage)</li>
<li>Titles: 32,945 new and 12,900 matched (70% coverage)</li>
<li>Editions: 88,697 new and 12,845 matched (87% coverage)</li>
<li>Series: 5,035 new and 139 matched (96% coverage)</li>
</ul>
<p>So you can enjoy <a title="Anathem" href="http://www.freebase.com/view/en/anathem">Anathem</a>, <a title="The Graveyard Book" href="http://www.freebase.com/view/en/the_graveyard_book">The Graveyard Book</a>, <a title="Little Brother" href="http://www.freebase.com/view/guid/9202a8c04000641f8000000007b809b9">Little Brother</a>, <a title="Saturn's Children" href="http://www.freebase.com/view/guid/9202a8c04000641f8000000008ece2d3">Saturn&#8217;s Children</a> and <a title="Zoe's Tale" href="http://www.freebase.com/view/guid/9202a8c04000641f80000000080fd18b">Zoe&#8217;s Tale</a>, along with comprehensive views of the works of <a title="Cory Doctrow" href="http://www.freebase.com/view/user/zenkat/default_domain/views/cory_doctrow_s_books">Cory Doctrow</a>, lists of <a title="Coraline Editions" href="http://www.freebase.com/view/en/coraline/-/book/book/editions">all editions of Coraline</a>, or the publication history of Stephenson&#8217;s <a title="Baroque Cycle" href="http://www.freebase.com/view/user/zenkat/default_domain/views/baroque_cycle">Baroque Cycle</a>.</p>
<p>Coming soon &#8212; cover art, awards, and better reconciliation coverage.  Enjoy!</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.freebase.com/2009/04/14/science-fiction-books/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Partial evidence and human intelligence: a dispatch from the data mines</title>
		<link>http://blog.freebase.com/2009/03/05/partial-evidence-and-human-intelligence-a-dispatch-from-the-data-mines/</link>
		<comments>http://blog.freebase.com/2009/03/05/partial-evidence-and-human-intelligence-a-dispatch-from-the-data-mines/#comments</comments>
		<pubDate>Thu, 05 Mar 2009 20:43:44 +0000</pubDate>
		<dc:creator>skud</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[Data loads]]></category>
		<category><![CDATA[freebase]]></category>
		<category><![CDATA[gwap]]></category>
		<category><![CDATA[typewriter]]></category>

		<guid isPermaLink="false">http://blog.freebase.com/?p=452</guid>
		<description><![CDATA[I&#8217;ve been dying to write about this for ages, and I hope I can do it justice.  It&#8217;s a hairy subject, but one that&#8217;s very close to all our hearts here at Freebase.
People sometimes ask use, Why haven&#8217;t you scraped infobox X from Wikipedia? or Why don&#8217;t you just load all the data from [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve been dying to write about this for ages, and I hope I can do it justice.  It&#8217;s a hairy subject, but one that&#8217;s very close to all our hearts here at Freebase.</p>
<p>People sometimes ask use, <b>Why haven&#8217;t you scraped infobox X from Wikipedia?</b> or <b>Why don&#8217;t you just load all the data from site Y?</b>  The short answer is, in many cases, we <i>have</i> fetched that data.  And yet we haven&#8217;t loaded it into Freebase.</p>
<p>Why not?</p>
<p>Well, information we gather <i>en masse</i> is seldom 100% correct.  For example, we might extract a list of fashion designers from Wikipedia&#8217;s category of the same name, intending to assert the fact <a href="http://www.freebase.com/view/people/person?q=eNqNkd1ugzAMhd_F1xEwbaq0vEqpUAaGRsvfEjMJVbz7kjBoWVtpVyTO8XeOzfECddlara2pS7JOtnUpPMlWIfALDKPsgJtRKQZKaknAXxhYR9IaoYCTH3Fm9wipxZAB_2rvBGFj--ZDejqv-rvOQ8XACI1r2YgFI2kCfnxgddjr97YMaHLxLWZXts2oNMVoyE8wnxg4JdrbVLcGz7Fxmq_3t-qVO297DCE-beGiFUaTXoRzLDcdBjkY9LDnbbEcWqcwfq-klCtUfJFcqZsSfYiqtADzmQUkNQYS2uXb0w2kQ12m35AaIuBbqPE3UfL8I8z4B_V8zRmtj_tfoxZJX2xRigX-aNIl_3z6AeZX5Xw%3D&#038;view_id=table">Profession: fashion designer</a> on the relevant <a href="http://freebase.com/view/people/person">Person</a> topics.  But what if some fashion <i>houses</i> or <i>labels</i> had snuck into the list of fashion designers?  We don&#8217;t want to accidentally type those companies or brands as people.</p>
<p>In general, when our data team collects information like this, rather than making assertions immediately in Freebase&#8217;s graph, they consider it to be <b>partial evidence</b> and it goes into a <b>partial evidence store</b>.  To get it out of that store and into Freebase, it either has to be</p>
<ul>
<li>confirmed by one or more other sources
<li>reviewed by a human being, or
<li>the original source has to be proven sufficiently accurate by statistical methods.
</ul>
<p>That&#8217;s where human intelligence and crowdsourcing comes into it.  Recently, some people on our <a href="http://blog.freebase.com/tag/acre/">Acre</a> team have built a series of game-like interfaces to the data team&#8217;s partial evidence store, so you can help us determine the accuracy of the assertions in there.</p>
<p>The first is <a href="http://typewriter.freebaseapps.com/">Typewriter</a>, which asks you to confirm guesses about the Freebase types to assign to topics we got from Wikipedia category pages.</p>
<p><img src="http://blog.freebase.com/wp-content/uploads/2009/03/elshow.png" alt="elshow" title="elshow" width="506" height="244" class="aligncenter size-full wp-image-454" /></p>
<p>Here you can see a topic, &#8220;El Show de las Doce&#8221;, along with a short blurb extracted from the Wikipedia article.  Reading the blurb, you can tell that this is in fact a TV show.  So you vote &#8220;Yes&#8221;, and the assertion is made in Freebase.  Hurrah, more <a href="http://www.freebase.com/view/tv/views/tv_program">TV shows</a>.</p>
<p><img src="http://blog.freebase.com/wp-content/uploads/2009/03/idol.png" alt="idol" title="idol" width="506" height="246" class="aligncenter size-full wp-image-455" /></p>
<p>Here&#8217;s another topic we picked up from the Wikipedia category, <a href="http://en.wikipedia.org/wiki/Category:Idol_television_series">Idol television series</a>.  If you look at that category, you&#8217;ll find that many of the topics linked, like &#8220;Canadian Idol&#8221; or &#8220;Malaysian Idol&#8221;, are television programs, but by no means all of them.  </p>
<p>The topic shown above is actually a book written by an &#8220;Idol&#8221; judge.  Using Typewriter, you would vote &#8220;No&#8221;, because it&#8217;s not a TV program.</p>
<p>That &#8220;no&#8221; vote makes it back to our partial evidence store and becomes a source of statistical evidence: inclusion in the &#8220;Idol television series&#8221; category on Wikipedia isn&#8217;t strong evidence that something is a TV program.  (Conversely, some of the type assertions from Wikipedia categories are very strong: we&#8217;ve found that categories like <a href="http://en.wikipedia.org/wiki/Category:1923_deaths">1923 deaths</a> are extremely strong evidence that someone is a <a href="http://freebase.com/view/people/deceased_person">deceased person</a>.)</p>
<p>But quite apart from all that heavy theory, <a href="http://typewriter.freebaseapps.com/">Typewriter</a> is just good fun.  There are keyboard shortcuts, an iPhone version for using <strike>in dull meetings</strike> on long commutes, and a leaderboard showing who&#8217;s been most active player this week and over all time.  You can see from the stats there that we&#8217;ve had 167,012 votes so far, of which 81% (over 135,000) were &#8220;yes&#8221;, meaning we&#8217;ve managed to make that many assertions in the graph.  </p>
<p>Take a moment to check it out, and let us know what you think!</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.freebase.com/2009/03/05/partial-evidence-and-human-intelligence-a-dispatch-from-the-data-mines/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>New load of 150,000 TV episodes into Freebase</title>
		<link>http://blog.freebase.com/2009/02/18/new-load-of-150000-tv-episodes-into-freebase/</link>
		<comments>http://blog.freebase.com/2009/02/18/new-load-of-150000-tv-episodes-into-freebase/#comments</comments>
		<pubDate>Thu, 19 Feb 2009 01:03:14 +0000</pubDate>
		<dc:creator>skud</dc:creator>
				<category><![CDATA[Data loads]]></category>

		<guid isPermaLink="false">http://blog.freebase.com/?p=437</guid>
		<description><![CDATA[Freebase Data Team member Al has just completed a huge load of TV episodes into Freebase.  He says:
The first set of data from theTVDB.com &#8212; a creative-commons database for TV &#8212; has finished loading on our production site.
This gave us 150,000 new episodes and 9,000 new seasons across 3,000 series. Some examples:

Mythbusters
Planet Earth
Baywatch
Lazytown

If your [...]]]></description>
			<content:encoded><![CDATA[<p>Freebase Data Team member <a href="http://freebase.com/view/user/alexander">Al</a> has just completed a huge load of TV episodes into Freebase.  He says:</p>
<p>The first set of data from <a href="http://thetvdb.com/">theTVDB.com</a> &#8212; a creative-commons database for TV &#8212; has finished loading on our production site.</p>
<p>This gave us 150,000 new episodes and 9,000 new seasons across 3,000 series. Some examples:</p>
<ul>
<li><a href="http://www.freebase.com/view/en/mythbusters">Mythbusters</a>
<li><a href="http://www.freebase.com/view/en/planet_earth">Planet Earth</a>
<li><a href="http://www.freebase.com/view/en/baywatch">Baywatch</a>
<li><a href="http://www.freebase.com/view/en/lazytown">Lazytown</a>
</ul>
<p>If your favorite show doesn&#8217;t look complete, do let me know, but fear not!  In the coming weeks I&#8217;m hoping to make these incremental improvements:</p>
<ul>
<li>improved coverage for popular shows via less conservative reconciliation
<li>episode screenshot images
<li>episode summaries
<li>actors and directors
<li>and regular syncs, to keep up with your favorite shows
</ul>
<p>All this data is, of course, available for apps and mashups via our hosted app development environment <a href="http://freebaseapps.com/">Acre</a>, or using our <a href="http://www.freebase.com/view/guid/9202a8c04000641f8000000008c0e536">REST API</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.freebase.com/2009/02/18/new-load-of-150000-tv-episodes-into-freebase/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Meet the Freebase bots</title>
		<link>http://blog.freebase.com/2009/01/06/meet-the-freebase-bots/</link>
		<comments>http://blog.freebase.com/2009/01/06/meet-the-freebase-bots/#comments</comments>
		<pubDate>Tue, 06 Jan 2009 22:31:15 +0000</pubDate>
		<dc:creator>zenkat</dc:creator>
				<category><![CDATA[Data loads]]></category>

		<guid isPermaLink="false">http://blog.freebase.com/?p=391</guid>
		<description><![CDATA[
Do you recognize this man?
Believe it or not, he&#8217;s one of our biggest contributors to Freebase.  He goes by many names that you may have seen on our pages &#8212; mwcl_wikipedia_en, gns_bot,  mw_template_bot, mw_prop_bot, mergebot are just a few of his many aliases.  He works tirelessly day and night.  And, as you might have guessed, [...]]]></description>
			<content:encoded><![CDATA[<p><img src="http://www.freebase.com/api/trans/image_thumb/guid/9202a8c04000641f80000000082ddae8?maxheight=300&amp;mode=fit&amp;maxwidth=150" alt="Bot Image" width="75" height="75" /></p>
<p>Do you recognize this man?</p>
<p>Believe it or not, he&#8217;s one of our biggest contributors to Freebase.  He goes by many names that you may have seen on our pages &#8212; <a href="http://freebase.com/view/user/mwcl_wikipedia_en">mwcl_wikipedia_en</a>, <a href="http://freebase.com/view/user/gns_bot">gns_bot</a>,  <a href="http://freebase.com/view/user/mw_template_bot">mw_template_bot</a>, <a href="http://freebase.com/view/user/mw_prop_bot">mw_prop_bot</a>, <a href="http://freebase.com/view/user/mergebot">mergebot</a> are just a few of his many aliases.  He works tirelessly day and night.  And, as you might have guessed, he&#8217;s not even human.</p>
<p>He&#8217;s Bot &#8212; the collection of all of the automated processes in Freebase&#8217;s data mining pipeline.  Freebase thrives on the content that you, our user community, creates.  But we also want to make sure that there&#8217;s a a wealth of raw information already there for you to draw upon when you&#8217;re adding your data.  That&#8217;s where Bot comes in.  By automatically gathering, cleaning and formatting data from around the web, Bot provides the Freebase community with a huge amount of data for &#8220;free&#8221;, letting you focus on the specific information your Base, View or Schema requires.</p>
<p>Bot starts his processing with that amazing repository of online information, <a href="http://en.wikipedia.org/">Wikipedia</a>.  Over the past 7 years, Wikipedia has seen remarkable growth.  There are currently 2.5 million articles in Wikipedia, on topics as wide-ranging as people, places, history, politics, music, culture, and science.  There are articles on companies, consumer goods, and pop media &#8212; and the number of articles is <a title="Wikipedia Growth Rates" href="http://en.wikipedia.org/wiki/Wikipedia:Modelling_Wikipedia%27s_growth">still doubling every 12-18 months</a>.  If a topic is of any importance at all, it&#8217;s a good bet that someone has written an article about it in Wikipedia.</p>
<p>But Wikipedia is written by humans, for humans.  This is great if you need learn about something, or quickly look up a fact.  But there are things that are difficult to do in Wikipedia.  It&#8217;s hard to ask questions, like &#8220;<a title="Ford Lucas Movies" href="http://www.freebase.com/view/user/zenkat/default_domain/views/ford_$002F_lucas_movies">What movies by George Lucas has Harrison Ford starred in?</a>&#8221; or &#8220;<a title="Final Fantasy" href="http://ffsquare.freebase.com/view/base/ffsquare/views/wall_of_characters">Who are all of the characters in the Final Fantasy franchise?</a>&#8221; or &#8220;<a title="Popes" href="http://popes.freebase.com/view/base/popes/views/popes_of_the_roman_catholic_church">Who are all the Popes of the Roman Catholic Church?</a>&#8221; or &#8220;<a href="http://sanfrancisco.freebase.com/view/base/sanfrancisco/views/tallest_buildings_in_san_francisco">What are the tallest buildings in San Francisco?</a>&#8220;.  It&#8217;s also hard to <a title="ACRE apps" href="http://blog.freebase.com/2008/12/12/featured-acre-app-freebase-sets/">build applications</a> on top of Wikipedia&#8217;s unstructured data.</p>
<p>Bot, however, has a way of extracting information from Wikipedia, known as <a title="WEX downloads" href="http://download.freebase.com/wex/">WEX</a> (short for &#8220;Wikipedia EXtractor&#8221;).  Run on Freebase&#8217;s <a title="Hadoop Core" href="http://download.freebase.com/wex/">Hadoop</a> cluster, WEX is a <a title="MapReduce paper from Google" href="http://labs.google.com/papers/mapreduce.html">MapReduce</a> process that translates the Wikipedia&#8217;s markup language into an easy-to-parse XML format.  With this XML format in hand (or actuator, as the case may be), Bot can then identify the articles, redirects, and IDs of the new wikipedia pages, along with a wealth of associated information like article blurbs, templates, images, and categories.  (As a side note, Freebase makes downloads of our <a title="Free Freebase Downloads" href="http://download.freebase.com/">WEX-parsed files freely available to the community on a quarterly basis</a>.  If you&#8217;re computationally savvy, check it out &#8212; it&#8217;s a treasure trove of raw data.  And if you need processing tools, you may want to check our <a href="http://code.google.com/p/happy/">Happy, our open-source Python extensions to Hadoop</a>.)</p>
<p><a href="http://blog.freebase.com/wp-content/uploads/2008/12/boeing_infobox.jpg"><img class="alignleft alignnone size-medium wp-image-393" src="http://blog.freebase.com/wp-content/uploads/2008/12/boeing_infobox-167x300.jpg" alt="Boeing 777 InfoBox" width="167" height="300" /></a></p>
<p>Bot then goes through and looks for new articles in Wikipedia that weren&#8217;t there the last time he checked.  After filtering out unsuitable articles (lists and the like), he collates the article title, the redirects that point to it, a 1200-character blurb, and an image or two, and creates a stub article.  The article title becomes the name, and the redirects become aliases.  At the end of it all, Bot gives a &#8220;stub article&#8221;, just waiting for a user like you to come along and add it to a base.</p>
<p>But the real value comes from the semistructured data that is available within Wikipedia.  Consider the &#8220;<a title="InfoBoxes" href="http://en.wikipedia.org/wiki/Help:Infobox">InfoBox</a>&#8220;, a feature often seen on Wikipedia pages.  Wikipedians use these templates to organize key information about their articles while giving a common look and feel.</p>
<p><a href="http://blog.freebase.com/wp-content/uploads/2008/12/boeing_template.jpg"><img class="alignright alignnone size-medium wp-image-394" src="http://blog.freebase.com/wp-content/uploads/2008/12/boeing_template-300x148.jpg" alt="Boeing 777 InfoBox Text" width="300" height="148" /></a></p>
<p>If you look behind the scenes at Wikipedia, you&#8217;ll see that this handy summary is created by a bit of structured wikimarkup text, shown on the right.  Bot *loves* data like this.  Here, in nicely-structured, computer-readable text, are a variety of assertions that look very much like freebase properties.  Not only does it tell us that &#8220;Boeing 777 is an aircraft&#8221; (or, in freebase terms, &#8220;<a title="Freebase Boeing Topic" href="http://www.freebase.com/view/en/boeing_777">/en/boeing_777</a>&#8221; has a type of &#8220;<a title="aircraft schema model" href="http://www.freebase.com/type/schema/aviation/aircraft_model">/aviation/aircraft_model</a>&#8220;, but it also tells us many of the properties of that type.  We also know that &#8220;Boeing 777&#8243; has a &#8220;manufacturer&#8221; of &#8220;Boeing Commericial Airplanes&#8221; (or, in freebase-speak, /en/boeing_777 &#8211;&gt; /aviation/aircraft_model/manufacturer &#8211;&gt; <a href="http://www.freebase.com/view/en/boeing_commercial_airplanes">/en/boeing_commercial_airplanes</a>), and that &#8220;Boeing 777&#8243; had a &#8220;maiden flight&#8221; on &#8220;Jun 12, 1994&#8243; (/en/boeing_777 &#8211;&gt; /aviation/aircraft_model/maiden_flight &#8211;&gt; 1994-06-12).</p>
<p>Since WEX parses all of the InfoBoxes properties (and other templates) within wikipedia &#8212; all 45 million of them! &#8212; Bot has them at his disposal to create facts in Freebase.  Of course, he needs humans to tell him what the right property is for a given InfoBox &#8212; but we&#8217;ve fed him thousands of these mappings already, which has let him assert 3.9 million high-quality properties based on InfoBoxes.</p>
<p>Similarly, Wikipedia category pages contain information about the types of the topics listed in the category.  For instance, if you see &#8220;<a href="http://en.wikipedia.org/wiki/San_Francisco_Giants">San Francisco Giants</a>&#8221; on a category page called &#8220;<a title="Wikipedia Category" href="http://en.wikipedia.org/wiki/Category:Baseball_teams_in_California">Baseball teams in California</a>&#8220;, you can be reasonably sure that &#8220;<a href="http://www.freebase.com/view/en/san_francisco_giants">/en/san_francisco_giants</a>&#8221; should be typed as a &#8220;<a title="Baseball teams in Freebase" href="http://www.freebase.com/view/baseball/baseball_team">/baseball/baseball_team</a>&#8220;.  WEX has over 7 million category-article associations like this &#8212; from which Bot has asserted over 1.2M types.</p>
<p>This is just the start.  Here at Freebase, we&#8217;re constantly adding new functionality to Bot.  For instance, there are Bots to make sure that all valid mailing addresses in the US are <a title="Geocoding" href="http://en.wikipedia.org/wiki/Geocoding">geocoded</a> with a valid latitude and longitude, so that they can be visualized on tools like <a title="Golf Course in the USA" href="http://tinyurl.com/64y3dk">Google Maps</a>.  We also have Bots to load data about newly released films, including actor and director information.  There are Bots to clean up messy and invalid data, Bots to process <a title="Freebase Voting Queue" href="http://www.freebase.com/tools/pipeline/showtask">scheduled merges and deletes</a>, and Bots to find new images for topics.  The are even <a title="DeathBot" href="http://blog.freebase.com/2008/05/29/introducing-deathbot/">bots that mark the passing of the dearly departed</a>.  And  more is planned &#8212; soon we plan on having Bots that provide you with data about <a title="Books in Freebase" href="http://www.freebase.com/view/book/book">books</a>, <a href="http://www.freebase.com/view/film/film">films</a>, <a title="digicams" href="http://www.freebase.com/view/digicams/views/digital_camera">consumer electronics</a>, and more!</p>
<p>Do you have tasks that you think Bot should be assigned?  Data that he should focus his unwavering attention on?  We&#8217;d love to hear from you &#8212; let us know your thoughts!</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.freebase.com/2009/01/06/meet-the-freebase-bots/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>100,000 books</title>
		<link>http://blog.freebase.com/2009/01/05/100000-books/</link>
		<comments>http://blog.freebase.com/2009/01/05/100000-books/#comments</comments>
		<pubDate>Mon, 05 Jan 2009 19:34:00 +0000</pubDate>
		<dc:creator>toby</dc:creator>
				<category><![CDATA[Data loads]]></category>

		<guid isPermaLink="false">http://blog.freebase.com/?p=402</guid>
		<description><![CDATA[We&#8217;ve recently added over 100,000 books to Freebase, many with cover art, along with more than 200,000 editions of these books. Rather than load every possible book, we chose a large set of authors who were famous enough to be in Wikipedia and attempted to get all of their books.
The subject data combined with the [...]]]></description>
			<content:encoded><![CDATA[<p>We&#8217;ve recently added over 100,000 books to Freebase, many with cover art, along with more than 200,000 editions of these books. Rather than load every possible book, we chose a large set of authors who were famous enough to be in Wikipedia and attempted to get all of their books.</p>
<p>The subject data combined with the cover art makes for very cool galleries using the new topic views, here are a some examples:</p>
<ul>
<li><a href="http://www.freebase.com/view/book/views/books_about_animals" target="_blank">Books about animals</a></li>
<li><a href="http://www.freebase.com/view/book/views/books_by_warren_wiersbe" target="_blank">Books by Warren Wiersbe</a></li>
<li><a href="http://www.freebase.com/view/book/views/cookbooks" target="_blank">Cookbooks</a></li>
<li><a href="http://www.freebase.com/view/book/views/romance_novels_about_vampires" target="_blank">Romance novels about Vampires</a></li>
</ul>
<p>For many of the books, we also have edition data, which includes things like when the edition was published, its ISBN and Library of Congress codes, and the number of pages. This lets you sort the editions by how long they are &#8212; <a href="http://www.freebase.com/view/book/views/book_editions_with_the_most_pages" target="_blank">here are the longest editions in Freebase</a>.</p>
<p>We&#8217;d love to see what collections you can come up with. Leave a comment if you find a set of books that&#8217;s particularly striking.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.freebase.com/2009/01/05/100000-books/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Recent data loads: airports, saints, beer, quotations, and more</title>
		<link>http://blog.freebase.com/2008/08/05/recent-data-loads-airports-saints-beer-quotations-and-more/</link>
		<comments>http://blog.freebase.com/2008/08/05/recent-data-loads-airports-saints-beer-quotations-and-more/#comments</comments>
		<pubDate>Tue, 05 Aug 2008 19:54:00 +0000</pubDate>
		<dc:creator>skud</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[Data loads]]></category>

		<guid isPermaLink="false">http://blog.freebase.com/?p=276</guid>
		<description><![CDATA[Some of our recent data loads:

1,200 saints
1,900 new airports
2,700 beers from the Oxford bottled beer database
21,310 images from non-English-language Wikipedias
31,212 quotations from QuotationsBook
81,000 US place names from the GNIS database

Mmmm, tasty tasty data!
]]></description>
			<content:encoded><![CDATA[<p>Some of our recent data loads:</p>
<ul>
<li>1,200 <a href="http://www.freebase.com/view/user/xtine/saints/saint">saints</a>
<li>1,900 new <a href="http://www.freebase.com/view/aviation/airport">airports</a>
<li>2,700 <a href="http://www.freebase.com/view/food/beer">beers</a> from the <a href="http://www.bottledbeer.co.uk/">Oxford bottled beer database</a>
<li>21,310 images from non-English-language Wikipedias
<li>31,212 quotations from <a href="http://quotationsbook.com/">QuotationsBook</a>
<li>81,000 US place names from the <a href="http://www.freebase.com/view/en/geographic_names_information_system">GNIS</a> database
</ul>
<p>Mmmm, tasty tasty data!</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.freebase.com/2008/08/05/recent-data-loads-airports-saints-beer-quotations-and-more/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>200k new topics!</title>
		<link>http://blog.freebase.com/2008/06/25/200k-new-topics/</link>
		<comments>http://blog.freebase.com/2008/06/25/200k-new-topics/#comments</comments>
		<pubDate>Thu, 26 Jun 2008 00:32:10 +0000</pubDate>
		<dc:creator>skud</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[Data loads]]></category>

		<guid isPermaLink="false">http://blog.freebase.com/?p=226</guid>
		<description><![CDATA[If you&#8217;ve looked at our homepage lately, you&#8217;ll see we&#8217;re heading towards 3.9 million topics, up a couple of hundred thousand since last time we checked in.
I hadn&#8217;t heard about any big new data loads, so I went looking for what had caused this sudden spike in our figures.  Well, it turns out we&#8217;d [...]]]></description>
			<content:encoded><![CDATA[<p>If you&#8217;ve looked at our homepage lately, you&#8217;ll see we&#8217;re heading towards 3.9 million topics, up a couple of hundred thousand since <a href="http://blog.freebase.com/2008/05/23/recent-data-growth/">last time we checked in</a>.</p>
<p>I hadn&#8217;t heard about any big new data loads, so I went looking for what had caused this sudden spike in our figures.  Well, it turns out we&#8217;d had that data all along!  There were a number of topics in the system that weren&#8217;t typed as /common/topic, so weren&#8217;t showing up in the count.  For example, a bot which had added a number of people (type /people/person) had omitted to co-type as /common/topic, and that doesn&#8217;t happen automatically when a bot does it, only when done via the UI>.  </p>
<p><a href="http://freebase.com/view/user/jargonjustin">Justin</a> discovered this the other day, and since it&#8217;s obvious that people are also topics, and so on, he fixed it.  So that&#8217;s how we&#8217;ve got so many &#8220;new&#8221; topics.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.freebase.com/2008/06/25/200k-new-topics/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Brian Karlak on mining Wikipedia</title>
		<link>http://blog.freebase.com/2008/06/10/brian-karlak-on-mining-wikipedia/</link>
		<comments>http://blog.freebase.com/2008/06/10/brian-karlak-on-mining-wikipedia/#comments</comments>
		<pubDate>Tue, 10 Jun 2008 23:44:25 +0000</pubDate>
		<dc:creator>skud</dc:creator>
				<category><![CDATA[Data loads]]></category>
		<category><![CDATA[Events]]></category>

		<guid isPermaLink="false">http://blog.freebase.com/?p=209</guid>
		<description><![CDATA[Here&#8217;s a video from our last user group meeting.  Brian Karlak, one of our data team, tells us how Freebase mines Wikipedia for structured data.

Since you can&#8217;t see the slides very well because of the lighting in the video, here is his Powerpoint deck uploaded to slideshare.net:

 &#124; View &#124; Upload your own

Don&#8217;t forget [...]]]></description>
			<content:encoded><![CDATA[<p>Here&#8217;s a video from our last user group meeting.  Brian Karlak, one of our data team, tells us how Freebase mines Wikipedia for structured data.</p>
<p><object type="application/x-shockwave-flash" data="http://blip.tv/scripts/flash/showplayer.swf?enablejs=true&#038;feedurl=http%3A%2F%2Ffreebase%2Eblip%2Etv%2Frss&#038;file=http%3A%2F%2Fblip%2Etv%2Frss%2Fflash%2F987501%3Freferrer%3Dblip%2Etv%26source%3D1&#038;showplayerpath=http%3A%2F%2Fblip%2Etv%2Fscripts%2Fflash%2Fshowplayer%2Eswf" width="400" height="255" allowfullscreen="true" id="showplayer"><param name="movie" value="http://blip.tv/scripts/flash/showplayer.swf?enablejs=true&#038;feedurl=http%3A%2F%2Ffreebase%2Eblip%2Etv%2Frss&#038;file=http%3A%2F%2Fblip%2Etv%2Frss%2Fflash%2F987501%3Freferrer%3Dblip%2Etv%26source%3D1&#038;showplayerpath=http%3A%2F%2Fblip%2Etv%2Fscripts%2Fflash%2Fshowplayer%2Eswf" /><param name="quality" value="best" /><embed src="http://blip.tv/scripts/flash/showplayer.swf?enablejs=true&#038;feedurl=http%3A%2F%2Ffreebase%2Eblip%2Etv%2Frss&#038;file=http%3A%2F%2Fblip%2Etv%2Frss%2Fflash%2F987501%3Freferrer%3Dblip%2Etv%26source%3D1&#038;showplayerpath=http%3A%2F%2Fblip%2Etv%2Fscripts%2Fflash%2Fshowplayer%2Eswf" quality="best" width="400" height="255" name="showplayer" type="application/x-shockwave-flash"></embed></object></p>
<p>Since you can&#8217;t see the slides very well because of the lighting in the video, here is his Powerpoint deck uploaded to slideshare.net:</p>
<div style="width:425px;text-align:left" id="__ss_433705"><object style="margin:0px" width="425" height="355"><param name="movie" value="http://static.slideshare.net/swf/ssplayer2.swf?doc=freebasewikipedia20080416web-1212010398759168-9"/><param name="allowFullScreen" value="true"/><param name="allowScriptAccess" value="always"/><embed src="http://static.slideshare.net/swf/ssplayer2.swf?doc=freebasewikipedia20080416web-1212010398759168-9" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="425" height="355"></embed></object>
<div style="font-size:11px;font-family:tahoma,arial;height:26px;padding-top:2px;"><a href="http://www.slideshare.net/?src=embed"><img src="http://static.slideshare.net/swf/logo_embd.png" style="border:0px none;margin-bottom:-5px" alt="SlideShare"/></a> | <a href="http://www.slideshare.net/zenkat/freebase-wikipedia-mining-20080416?src=embed" title="View Freebase: Wikipedia Mining 20080416 on SlideShare">View</a> | <a href="http://www.slideshare.net/upload?src=embed">Upload your own</a></div>
</div>
<p>Don&#8217;t forget there&#8217;s a <a href="http://blog.freebase.com/2008/06/04/freebase-user-group-meeting-tuesday-june-17th/">Freebase User Group meeting</a> next Tuesday, June 17th, in San Francisco.  If you&#8217;re in the Bay Area you should definitely try to make it!  <a href="http://upcoming.yahoo.com/event/760574/">RSVP at upcoming.org</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.freebase.com/2008/06/10/brian-karlak-on-mining-wikipedia/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
	</channel>
</rss>
