I've been playing with Google Natural Language API and in particular used the locations recognition to extract the locations from HN's "Who Is Hiring" page. If I pass a text like
Blockai | San Francisco, CA | CV/ML and Front-end Engineers - https://blockai.com"
(from https://news.ycombinator.com/item?id=12631335)
Then the NL API returns the following entities:
The problem is "ML" and "CV" are recognized as locations, but they actually stand for "Machine Learning" and "Computer Vision" respectively. I guess the algorithm concludes that CV/ML are the locations because they're close to other locations(San Francisco, CA) in the text.
I was wondering how I can recognize such "fake" locations in the API's output? I thought that maybe using "Salience" parameter would help, but I am not sure what rule of thumb would be suitable..I even found the API sometimes responses with Salience values that are greater than 1 despite of the docs say that these values are "in the [0, 1.0] range.", f.e.:
{
"name":"San Francisco",
"type":"LOCATION",
"metadata":{
"wikipedia_url":"http://en.wikipedia.org/wiki/San_Francisco"
},
"salience":1.4515763148665428,
"mentions":[ ]
},
Any help is highly appreciated!