0

I've been playing with Google Natural Language API and in particular used the locations recognition to extract the locations from HN's "Who Is Hiring" page. If I pass a text like

Blockai | San Francisco, CA | CV/ML and Front-end Engineers - https://blockai.com"

(from https://news.ycombinator.com/item?id=12631335)

Then the NL API returns the following entities: enter image description here

The problem is "ML" and "CV" are recognized as locations, but they actually stand for "Machine Learning" and "Computer Vision" respectively. I guess the algorithm concludes that CV/ML are the locations because they're close to other locations(San Francisco, CA) in the text.

I was wondering how I can recognize such "fake" locations in the API's output? I thought that maybe using "Salience" parameter would help, but I am not sure what rule of thumb would be suitable..I even found the API sometimes responses with Salience values that are greater than 1 despite of the docs say that these values are "in the [0, 1.0] range.", f.e.:

{  
  "name":"San Francisco",
  "type":"LOCATION",
  "metadata":{  
     "wikipedia_url":"http://en.wikipedia.org/wiki/San_Francisco"
  },
  "salience":1.4515763148665428,
  "mentions":[  ]

},

Any help is highly appreciated!

sovo2014
  • 487
  • 7
  • 17

1 Answers1

1

Sometimes it's very tricky for the underlying algorithms to disambiguate entities, esp. when there is not enough context. Salience does not help with this, because salience shows how central an entity is, regardless of its type. In this particular case, you could potentially use the provided metadata (e.g. wikipedia url) to further assess whether the entity is indeed a location.