New API service: Reconciliation
[This is a guest post by Peter, who's just released this new reconciliation service.]
Suppose you’ve compiled a detailed database of sneezing in film, tracking the character, location, and cause of each sneeze (illness, allergies, etc). Naturally, you’d like to mash up your data with info in Freebase. What production companies are brave enough to push the limits of sneezing cinema? Do actor-directors tend to give sneezing parts to their own characters? These questions burn to be answered!
The first step in creating a mash-up is to determine how your entities — in this case, movies — match against entities in Freebase. We call this the reconciliation problem, but fortunately there are a couple of ways to tackle it.
The best option, when available, is to use a key. Keys are properties that uniquely identify an entity, like the Freebase GUID for each topic. It’s unlikely that you’ll be lucky enough to have Freebase GUIDs for your data, but there are domain-specific keys that you can use. For example, movies have Netflix IDs and IMDB profile pages, books have ISBNs, commercial products have UPCs, and just about everything in modern music has a MusicBrainz ID.
Reconciling with keys is easy. If Freebase has the key, a single mql query will give you the movie in question. For example, if you have the movie Plan 9 from Outer Space in your database, you could find it in Freebase by the Freebase id, wikipedia article name, wikipedia curid, or the imdb_url.
But if you don’t have a key, how can you reconcile your entities? Fortunately, we’ve come up with an experimental new web API service that exposes some of our reconciliation techniques and makes them easy to use. The service takes a name and a list of Freebase types, and then returns a list of the Freebase topics that are the closest match. Because the service is intended for use with automated reconciliation tools, it also returns a recommendation for how to treat the results.
As an example, let’s try to reconcile something simple, like the well-known classic movie The Pumaman. If you look at the query you’ll see some JSON results that begin like this:
{
"recommendation": "automerge"
"results": [
{
"score": {
"aggregate": 1.4,
"type": 1.0,
"name": 1.0
},
"id": "\/topic\/en\/the_pumaman",
"name": "The Pumaman"
},
<snip...>
]
}
In the results the service will make one of three recommendations: “automerge,” “autocreate,” or “suggest.” The “automerge” result is ideal. It means that your program is free to treat the first result as the entity that you’re looking for, and we can add or retrieve whatever information we want about /topic/en/the_pumaman. A recommendation of “autocreate” is also good, and means that your program can automatically create a new topic because there’s nothing in Freebase that matches your query. The “suggest” result is returned when the reconciliation service can’t give a conclusive answer, perhaps because there’s not enough information. For example, there are over fifty John Smiths in Freebase, and more than a dozen films named A Christmas Carol (not including The Muppet Christmas Carol, or the planned 3D adaptation with Jim Carrey). When your program gets a “suggest” recommendation, it’s an indication that additional analysis, and perhaps human input, is needed. For movies, a simple heuristic based around the year the movie was released is usually enough to resolve reconciliation problems.
That’s the quick overview of the reconciliation service. The full technical documentation is here (with more examples). Feedback is welcome on the Freebase Developers’ Mailing list!