6

I admit that I havent searched extensively in the SO database. I tried reading the natural npm package but doesnt seem to provide the feature. I would like to know if the below requirement is somewhat possible ?

I have a database that has list of all cities of a country. I also have rating of these cities (best place to live, worst place to live, best rated city, worsrt rated city etc..). Now from the User interface, I would like to enable the user to enter free text and from there I should be able to search my database.

For e.g Best place to live in California or places near California or places in California

From the above sentence, I want to extract the nouns only (may be ) as this will be name of the city or country that I can search for.

Then extract 'best' means I can sort is a particular order etc...

Any suggestions or directions to look for?

I risk a chance that the question will be marked as 'debatable'. But the reason I posted is to get some direction to proceed.

Vaya
  • 560
  • 6
  • 20

3 Answers3

9

[I came across this question whilst looking for some use cases to test a module I'm working on. Obviously the question is a little old, but since my module addresses the question I thought I might as well add some information here for future searchers.]

You should be able to do what you want with a POS chunker. I've recently released one for Node that is modelled on chunkers provided by the NLTK (Python) and Standford NLP (Java) libraries (the chunk() and TokensRegex() methods, resepectively).

The module processes strings that already contain parts-of-speech, so first you'll need to run your text through a parts-of-speech tagger, such as pos:

var pos = require('pos');

var words = new pos.Lexer().lex('Best place to live in California');
var tags = new pos.Tagger()
  .tag(words)
  .map(function(tag){return tag[0] + '/' + tag[1];})
  .join(' ');

This will give you:

Best/JJS place/NN to/TO live/VB in/IN California/NNP ./.

Now you can use pos-chunker to find all proper nouns:

var chunker = require('pos-chunker');

var places = chunker.chunk(tags, '[{ tag: NNP }]');

This will give you:

Best/JJS place/NN to/TO live/VB in/IN {California/NNP} ./.

Similarly you could extract verbs to understand what people want to do ('live', 'swim', 'eat', etc.):

var verbs = chunker.chunk(tags, '[{ tag: VB }]');

Which would yield:

Best/JJS place/NN to/TO {live/VB} in/IN California/NNP ./.

You can also match words, sequences of words and tags, use lookahead, group sequences together to create chunks (and then match on those), and other such things.

Mark Birbeck
  • 2,813
  • 2
  • 25
  • 12
  • is this only work with english or it does support other languages (French, Spanish,...)? – aidonsnous Apr 11 '18 at 05:28
  • You would need to get a parts-of-speech tagger for the language you want. Once you have parsed your sentences and got the parts of speech, then you can pass it through `pos-chunker`. I'm afraid I don't know if there are equivalents in other languages, to the `pos` module. – Mark Birbeck Apr 12 '18 at 10:00
1

You probably don't have to identify what is a noun. Since you already have a list of city and country names that your system can handle, you just have to check whether the user input contains one of these names.

Thomas
  • 17,016
  • 4
  • 46
  • 70
  • Actually this was the first thing that I tried - but Im trying to work with Geo location. So SFO doesnt need to be present in my database but I can have the geocode of few places in California. – Vaya Jan 24 '14 at 16:07
  • 1
    @Vaya I don't understand exactly what you mean, but it's clear that in that case your question doesn't really describe what you're after. My answer addresses what you actually asked. – Thomas Jan 24 '14 at 17:03
0

Well firstly you'll need to find a way to identify nouns. There is no core node module or anything that can do this for you. You need to loop through all words in the string and then compare them against some kind of dictionary database so you can find each word and check if it's a noun.

I found this api which looks pretty promising. You query the API for a word and it sends you back a blob of data like this:

<?xml version="1.0" encoding="UTF-8"?>
<results>
    <result>
        <term>consistent, uniform</term>
        <definition>the same throughout in structure or composition</definition>
        <partofspeech>adj</partofspeech>
        <example>bituminous coal is often treated as a consistent and homogeneous product</example>
    </result>
</results>

You can see that it includes a partofspeech member which tells you that the word "consistent" is an adjective.


Another (and better) option if you have control over the text being stored is to use some kind of markup language to identify important parts of the string before you save it. Something like BBCode. I even found a BBCode node module that will help you do this.

Then you can save your strings to the database like this:

Best place to live in [city]California[/city] or places near [city]California[/city] or places in [city]California[/city].

or

My name is [first]Alex[/first] [last]Ford[/last].

If you're letting user's type whole sentences of text and then you're trying to figure out what parts of those sentences is data you should use in your app then you're making things very unnecessarily hard on yourself. You should either ask them to input important pieces of data into their own text boxes or you should give the user a formatting language such as the aforementioned BBCode syntax so they can identify important bits for you. The job of finding out which parts of a string are important is going to be a huge one for you I think.

CatDadCode
  • 58,507
  • 61
  • 212
  • 318
  • I dont want to control the text that the user can type in. For that matter, I already have screens with specific dropdowns and textboxes and search buttons. Im trying to interpret what the user wants (within my domain) and then drive to to the correct datastore to do the search. – Vaya Jan 24 '14 at 16:09
  • Then good luck. As I said already, you'll need a dictionary database complete with speech usage metadata. I don't see any other way. – CatDadCode Jan 24 '14 at 18:31