11

Is there some way to recognize that a word is likely to be/is not likely to be a person's name?

So if I see the word "understanding" I would get a probability of 0.01, whereas the word "Johnson" would return a probability of 0.99, while a word like Smith would return 0.75 and a word like Apple 0.15.

Is there any way to do this?

The goal is, if someone searches for, say Charles Darwin galapagos, the search engine guesses that it should search the author field for Charles and Darwin and the title and abstract fields for galapagos.

Jordan Reiter
  • 20,467
  • 11
  • 95
  • 161
  • Would checking the name against a huge list of known names work? – Shahbaz Sep 05 '12 at 22:32
  • Well, one way (which is not true for all cases) to do this would be to check if the name is actually in a dictionary. Cause, most of the times a name doesn't have a meaning `(your Charles Darwin)`. If it doesn't then you can conclude that its a name. If it does, then am not sure on how to proceed. – noMAD Sep 05 '12 at 22:37
  • @noMAD: This approach will fail to identify name of places (galapagos), and say they are name of people. – amit Sep 05 '12 at 22:38
  • 1
    @amit: Technically `galapagos` could be a name of a person, right! – noMAD Sep 05 '12 at 22:40
  • Names start with capital letters! – Kirk Broadhurst Sep 05 '12 at 22:55
  • 1
    @KirkBroadhurst - would that mean `Charles` is recognized as a name but `charles` is not? – Krease Sep 05 '12 at 22:56
  • 1
    @KirkBroadhurst capitalization is a terrible thing to rely on when talking about search queries. Most users do not use correct form in their search queries. Think about yourself, are you looking in google for Edgar Dijkstra or edgar dijkstra (if the former I can assure you, you are a minority) – amit Sep 05 '12 at 23:01
  • My comment was a obtuse but my point was that identifying names is unreliable and unhelpful. If the user searches for a book called 'Charles Darwin in the galapagos' and the don't know the author, and you search for books with author like `Charles` or `Darwin`, what happens? I think the underlying approach here is spurious and this kind of 'optimisation' can hurt more than help. – Kirk Broadhurst Sep 06 '12 at 03:27

3 Answers3

8

A related task in natural language processing is known as Named Entity Recognition and deals with names of people, organizations, locations, etc.

Most models designed to solve this problem are statistical in nature and use both context and prior knowledge in their predictions. There is a number of open source implementations one can use, e.g. the Stanford NER, see the online demo.

Qnan
  • 3,714
  • 18
  • 15
8

My quick hack would be this:

Get the list from the census bureau of names in order of popularity, it's freely available. Give each name a normalized popularity score (1.0 = most popular, 0.0 = least).

Then, get an opensource dictionary, and do some research to pull together a frequency score for every word. You can find one here, at wiktionary. Assign every word a popularity score, 1.0 to 0.0. The convenient thing is that if you can't find a word on the frequency list, you get to assume it's a pretty uncommon word.

Look for a word on both lists. If it's on just one or the other, you're done. If it's on both, use a formula to compute a weighted probability... something like (Name Popularity) / (Name Popularity + Other Popularity). If it's not on either list, it's probably a name.

  • +1 - this is much more useful than the basic dictionary lookup comparison I was thinking of (and assumed was excluded) – Krease Sep 06 '12 at 02:35
  • Could be a name, could be a typo, could be a foreign word or a proper name but not of a person. – Qnan Sep 06 '12 at 10:45
  • Mind you, I'm not criticizing, just pointing out the fact that dictionary information and context information are complementary. Think about "约翰 came home late" versus "That's a 谴责的 huge thing, he said". Not having seen either word before, a human would still suppose that the first one denoted some person or other animate object, and most likely is the name of that object, while the second is less likely to do so. – Qnan Sep 06 '12 at 10:55
0

Based on just the word (or series of words that does not form a sentence), I'd say no, or at least not one that would be able to provide any more information than a "known words dictionary" lookup.

Different locales would have different probabilities as well, and it's very much the position of the word in a sentence and the other words that signal whether it's a name or some other noun/verb.

For example, "Word" might be a:

  1. noun - "The word on the page is blurry"
  2. verb - "I word my sentences carefully"
  3. adjective - "I like word games"
  4. proper name - "My friend Word is nice to me"

It all depends on context and position in a sentence - and the rules for this change from language to language. Also, new names get invented regularly - next year's most popular baby name may "Galapagos" instead of "Liam".

Krease
  • 15,805
  • 8
  • 54
  • 86