Possible to train NLTK to detect "made up" names in a sentence?

Question

I've recently starting looking at data extraction using NLTK. While there are several examples and techniques for detecting "real" names, locations, etc.. I haven't found an efficient way to detect "made up" or "imaginary" names. An example string would be:

His name is wuzzywugg and he has a dog named fizzbuzz

I would like to train NLTK to be able to detect that "wuzzywugg" and "fizzbuzz" are names of characters. Seen some solutions that rely on the word starting with a CAPITAL letter, but this feels very "hacky" and prone to errors and false positives.

Any help on how to solve this issue would be greatly appreciated. Thanks in advance.

Named entity recognizers rely on a variety of clues (usually including capitalization) to decide which kind of named entity (if any) they are looking at. If you don't care to _distinguish_ actual from made up names, this should work well enough for you. — alexis, Apr 30 '17 at 21:06

score 0 · Answer 1 · answered Apr 27 '17 at 15:37

I ran on the same problem when processing Russian folktales, turns out that most of their names don't appear in western Gazeteers. A quick approach may be to use part-of-speech tags and get only NNP (proper nouns). Check this: http://www.nltk.org/book/ch05.html

This didn't work entirely for me, my approach involved actually extracting all noun phrases (NP nodes from the parse tree) and then extracting feature vectors that I annotated myself to build a ML classifier. You can find more information here: http://ieeexplore.ieee.org/document/7489041/

Possible to train NLTK to detect "made up" names in a sentence?

1 Answers1