Partial evidence and human intelligence: a dispatch from the data mines

I’ve been dying to write about this for ages, and I hope I can do it justice. It’s a hairy subject, but one that’s very close to all our hearts here at Freebase.

People sometimes ask use, Why haven’t you scraped infobox X from Wikipedia? or Why don’t you just load all the data from site Y? The short answer is, in many cases, we have fetched that data. And yet we haven’t loaded it into Freebase.

Why not?

Well, information we gather en masse is seldom 100% correct. For example, we might extract a list of fashion designers from Wikipedia’s category of the same name, intending to assert the fact Profession: fashion designer on the relevant Person topics. But what if some fashion houses or labels had snuck into the list of fashion designers? We don’t want to accidentally type those companies or brands as people.

In general, when our data team collects information like this, rather than making assertions immediately in Freebase’s graph, they consider it to be partial evidence and it goes into a partial evidence store. To get it out of that store and into Freebase, it either has to be

  • confirmed by one or more other sources
  • reviewed by a human being, or
  • the original source has to be proven sufficiently accurate by statistical methods.

That’s where human intelligence and crowdsourcing comes into it. Recently, some people on our Acre team have built a series of game-like interfaces to the data team’s partial evidence store, so you can help us determine the accuracy of the assertions in there.

The first is Typewriter, which asks you to confirm guesses about the Freebase types to assign to topics we got from Wikipedia category pages.

elshow

Here you can see a topic, “El Show de las Doce”, along with a short blurb extracted from the Wikipedia article. Reading the blurb, you can tell that this is in fact a TV show. So you vote “Yes”, and the assertion is made in Freebase. Hurrah, more TV shows.

idol

Here’s another topic we picked up from the Wikipedia category, Idol television series. If you look at that category, you’ll find that many of the topics linked, like “Canadian Idol” or “Malaysian Idol”, are television programs, but by no means all of them.

The topic shown above is actually a book written by an “Idol” judge. Using Typewriter, you would vote “No”, because it’s not a TV program.

That “no” vote makes it back to our partial evidence store and becomes a source of statistical evidence: inclusion in the “Idol television series” category on Wikipedia isn’t strong evidence that something is a TV program. (Conversely, some of the type assertions from Wikipedia categories are very strong: we’ve found that categories like 1923 deaths are extremely strong evidence that someone is a deceased person.)

But quite apart from all that heavy theory, Typewriter is just good fun. There are keyboard shortcuts, an iPhone version for using in dull meetings on long commutes, and a leaderboard showing who’s been most active player this week and over all time. You can see from the stats there that we’ve had 167,012 votes so far, of which 81% (over 135,000) were “yes”, meaning we’ve managed to make that many assertions in the graph.

Take a moment to check it out, and let us know what you think!

Tags: , ,

Comments are closed.

About

Freebase is a free database of the world's information. This is the official Freebase blog.