How to use Freebase to label a very large unlabeled NLP dataset?

Question

Vocabulary that I am using:

nounphrase -- A short phrase that refers to a specific person, place, or idea. Examples of different nounphrases include "Barack Obama", "Obama", "Water Bottle", "Yellowstone National Park", "Google Chrome web browser", etc.

category -- The semantic concept defining which nounphrases belong to it and which ones do not. Examples of categories include, "Politician", "Household items", "Food", "People", "Sports teams", etc. So, we would have that "Barack Obama" belongs to "Politician" and "People" but does not belong to "Food" or "Sports teams".

I have a very lage unlabeled NLP dataset consisting of millions of nounphrases. I would like to use Freebase to label these nounphrases. I have a mapping of Freebase types to my own categories. What I need to do is download every single examples for every single Freebase type that I have.

The problem that I face is that need to figure out how to structure this type of query. At a high level, the query should ask Freebase "what are all of the examples of topic XX?" and Freebase should respond with "here's a list of all examples of topic XX." I would be very grateful if someone could give me the syntax of this query. If it can be done in Python, that would be awesome :)

score 4 · Accepted Answer · edited Nov 14 '11 at 19:46

The basic form of the query (for a person, for example) is

[{
  "type":"/people/person",
  "name":None,
  "/common/topic/alias":[],
  "limit":100
}]

There's documentation available at http://wiki.freebase.com/wiki/MQL_Manual

Using freebase.mqlreaditer() from the Python library http://code.google.com/p/freebase-python/ is the easiest way to cycle through all of these. In this case, the "limit" clause determines the chunk size used for querying, but you'll get each result individually at the API level.

BTW, how do you plan to disambiguate Jack Kennedy the president, from the hurler, from the football player, from the book, etc, etc http://www.freebase.com/search?limit=30&start=0&query=jack+kennedy You may want to consider capturing additional information from Freebase (birth & death dates, book authors, other types assigned, etc) if you'll have enough context to be able to use it to disambiguate.

Past a certain point, it may be easier and/or more efficient to work from the bulk data dumps rather than the API http://wiki.freebase.com/wiki/Data_dumps

Edit - here's a working Python program which assumes you've got a list of type IDs in a file called 'types.txt':

import freebase

f = file('types.txt')
for t in f:
    t=t.strip()
    q = [{'type':t,
          'mid':None,
          'name':None,
          '/common/topic/alias':[],
          'limit':500,
          }]
    for r in freebase.mqlreaditer(q):
        print '\t'.join([t,r['mid'],r['name']]+r['/common/topic/alias'])
f.close()

If you make the query much more complex, you'll probably want to lower the limit to keep from running into timeouts, but for a simple query like this, boosting the limit above the default of 100 will make it more efficient by querying in bigger chunks.

Thank you very much Tom! I ended up using the data dumps. The python code is, however, extremely useful as I very much wanted to know how to grab instances from Freebase. I also appreciate your comment on disambiguation. Currently, a PhD student in my research group is focusing on this disambiguation problem from a machine learning perspective. It would be interesting to see if he could use Freebase to augment is current approach. — Malcolm, Nov 14 '11 at 18:59

score 1 · Answer 2 · answered Nov 14 '11 at 19:59

1

The general problem described here is called Entity Linking in natural language processing.

Unabashed self plug:

See our book chapter on the topic for an introduction and an approach to perform large scale entity linking.

http://cs.jhu.edu/~delip/entity_linking.pdf

@deliprao

answered Nov 14 '11 at 19:59

Delip

455
1
4
12

Thank you for the chapter. I'm sure it will come in handy. – Malcolm Nov 15 '11 at 00:21

How to use Freebase to label a very large unlabeled NLP dataset?

2 Answers2