Gender and names in Freebase

If you’ve looked at the Genderizer queue lately, you’ll have noticed a lot of Person topics that have a name, but no description or picture. It makes it hard to guess their gender, and you’re left relying just on their given name. It seemed to me that a computer could do this just as well as a human, so I asked our data team what they thought.

Brian Karlak looked into it and told me:

We have ~600K people in Freebase with both names and genders.

I took the first name as the first space-separated token in the /type/object/name of these people. Dreadfully simplistic, I know, but it will do for this analysis. It does mean that “first names” will include such honorifics as “King”, “Princess” and “Dr.”, but that’s OK — some correlate very well with gender.

I looked for first names with at least 100 exemplars. Of those, there were 693 first names that correlated >99% with a particular gender. There were 507 that showed 100% correlation with a particular gender. Over 80% (412) of these were male names.

There were 127 first names that had at least 100 exemplars but did not correlate consistently with a single gender. The most gender-ambiguous name was Andrea (nearly 50/50 split), followed by Ashley (48/52), Dana (53/47), Nicola (47/53), and Charlie (47/53).

Brian tells me we’ll be able to use this to assert the genders of people in Freebase, so you don’t have to.

If you’re interested in gendered names in Freebase, I recently created an app to do more or less what Brian is doing above. It’s called Gendered Names and if you tell it a given name, it will tell you the split between male and female based on what’s in Freebase and present it as a nice chart. Here’s what it thinks about the name “Evelyn” for example:

Of course, as with any Acre app, you can view the source and clone it if you’d like to build something similar.

Slides from the NYC semweb meetup

Last week Robert and Jamie were in New York where they presented about Freebase at the NYC semantic web meetup. Their slides are now up on Slideshare.net:

David Tunkelang, who was at the meetup, has also blogged about it.

Freebase at Wikimania

Just a quick heads-up: I’m at Wikimania this week in Buenos Aires, meeting a bunch of fabulous Wikimedians and hopefully attending some great presentations.

If you’re here too, please look out for me and say hi! My nametag says: “Kirrily Robert [[User:Skud]] Freebase” and I have a bunch of Freebase stickers and buttons to give away.

Acre stats: 1000 apps, and growing

Stefano’s put together an Acre app (what else?) to show some interesting stats and information about new Acre apps: acre-development.freebaseapps.com. Looking at it yesterday, we noticed that we’d just passed the 1000 apps mark!

Some other interesting tools that are part of the acre-development app include:

As usual, the source code for the app is there for you to read or clone.

Freebase now has 8.4 million topics

Remember not so long ago when we were very excited about crossing the 5 million mark? Well, in the last couple of weeks, we suddenly raced past 6, 7, and 8 million topics in Freebase, to reach a current total of 8,450,348 topics.

This is largely thanks to two big loads from the data team: first up, a massive import of millions of books and related information from the Open Library Project, and secondly a big load of around 255k TV episodes from TVRage.

Here’s a chart showing topic growth in Freebase since the beginning of April:

Topic growth in Freebase, April to August 2009

Topic growth in Freebase, April to August 2009

To take a look at the relevant data, visit our Publishing Commons or TV Commons. For some apps taking advantage of all this data, look at Tippify for book recommendations, or Alex’s TV4Me app (currently in development), a mashup of Freebase’s TV data with US broadcast schedules.

About

Freebase is a free database of the world's information. This is the official Freebase blog.