Gender and names in Freebase

If you’ve looked at the Genderizer queue lately, you’ll have noticed a lot of Person topics that have a name, but no description or picture. It makes it hard to guess their gender, and you’re left relying just on their given name. It seemed to me that a computer could do this just as well as a human, so I asked our data team what they thought.

Brian Karlak looked into it and told me:

We have ~600K people in Freebase with both names and genders.

I took the first name as the first space-separated token in the /type/object/name of these people. Dreadfully simplistic, I know, but it will do for this analysis. It does mean that “first names” will include such honorifics as “King”, “Princess” and “Dr.”, but that’s OK — some correlate very well with gender.

I looked for first names with at least 100 exemplars. Of those, there were 693 first names that correlated >99% with a particular gender. There were 507 that showed 100% correlation with a particular gender. Over 80% (412) of these were male names.

There were 127 first names that had at least 100 exemplars but did not correlate consistently with a single gender. The most gender-ambiguous name was Andrea (nearly 50/50 split), followed by Ashley (48/52), Dana (53/47), Nicola (47/53), and Charlie (47/53).

Brian tells me we’ll be able to use this to assert the genders of people in Freebase, so you don’t have to.

If you’re interested in gendered names in Freebase, I recently created an app to do more or less what Brian is doing above. It’s called Gendered Names and if you tell it a given name, it will tell you the split between male and female based on what’s in Freebase and present it as a nice chart. Here’s what it thinks about the name “Evelyn” for example:

Of course, as with any Acre app, you can view the source and clone it if you’d like to build something similar.

Tags: , , ,

2 Responses to “Gender and names in Freebase”

  1. Laurens Holst Says:

    Hmm, I would be careful though.

    For example, the name ‘Anne’ in English is pretty much always a female name. However in Dutch it is also a male name. So even though for 99% of the world population it may be pretty unambiguously a female name, in the context of a specific language or country it could be either a male or a female name.

    So if there is regional information for a topic, it would be good to take this into account to improve the accuracy of this method. Maybe first do a project to derive regional information? :) Based on keywords and presence/article size in regional wikis you could get pretty far, too ;p.

    I looked in the Gendered names app, and it looks like Anne is male in 3% of the cases on Freebase. Of the examples given, most cases seem to be French, so I guess the Dutch Annes are ‘lucky’ that a bigger country also has Anne as a male name.

    ~Laurens

  2. Laurens Holst Says:

About

Freebase is a free database of the world's information. This is the official Freebase blog.