0

I've recently starting looking at data extraction using NLTK. While there are several examples and techniques for detecting "real" names, locations, etc.. I haven't found an efficient way to detect "made up" or "imaginary" names. An example string would be:

His name is wuzzywugg and he has a dog named fizzbuzz

I would like to train NLTK to be able to detect that "wuzzywugg" and "fizzbuzz" are names of characters. Seen some solutions that rely on the word starting with a CAPITAL letter, but this feels very "hacky" and prone to errors and false positives.

Any help on how to solve this issue would be greatly appreciated. Thanks in advance.

django-d
  • 2,210
  • 3
  • 23
  • 41
  • Named entity recognizers rely on a variety of clues (usually including capitalization) to decide which kind of named entity (if any) they are looking at. If you don't care to _distinguish_ actual from made up names, this should work well enough for you. – alexis Apr 30 '17 at 21:06

1 Answers1

0

I ran on the same problem when processing Russian folktales, turns out that most of their names don't appear in western Gazeteers. A quick approach may be to use part-of-speech tags and get only NNP (proper nouns). Check this: http://www.nltk.org/book/ch05.html

This didn't work entirely for me, my approach involved actually extracting all noun phrases (NP nodes from the parse tree) and then extracting feature vectors that I annotated myself to build a ML classifier. You can find more information here: http://ieeexplore.ieee.org/document/7489041/

Josep Valls
  • 5,483
  • 2
  • 33
  • 67