How to split sentences into correlated words (term extraction)?

Question

Is there any NLP python library that split sentence or joins words into related pairs of words? For example:

That is not bad example -> "That" "is" "not bad" "example"

"Not bad" means the same as good so it would be useless to process it as "not" and "bad" in machine learning. I dont even know how to call these pairs of words that are correlated. (term extraction? phases extraction?) Or would be even better to split into adjectives with nouns for example:

dishonest media relating about tax cuts -> "dishonest media", "relating", "about", "tax cuts"

I found topia.termextract but it does not work with python3.

checkout spacy. https://spacy.io/usage/linguistic-features#section-tokenization I don't know why people downvoted. — matisetorm, Feb 21 '18 at 22:01
see the section on `PhraseMatcher()` in "Rules-based Matching". It is a highly customizable framework that allows for lemma rules, regex, etc. But isn't an out of the box solution. Anyway. Cheers — matisetorm, Feb 21 '18 at 22:55
'not bad' is not the same as 'good'... in fact it's less than good but higher than bad. — RealRageDontQuit, Feb 06 '20 at 12:03

matisetorm · Answer 1 · 2018-02-22T00:20:03.120

1

Checkout Spacy library (see links).

It doesn't have that functionality out of the box, as you need to build the rules, but the rules are very human readable, and there are many options you can feed in (POS tags, regex, lemma, or any combination of those, etc.)

Of particular note are the sections on the PhraseMarker() class.

Directly copied from the documentation is a code sample:

import spacy
from spacy.matcher 
import PhraseMatcher

nlp = spacy.load('en')
matcher = PhraseMatcher(nlp.vocab)
terminology_list = ['Barack Obama', 'Angela Merkel', 'Washington, D.C.']
patterns = [nlp(text) for text in terminology_list]
matcher.add('TerminologyList', None, *patterns)

doc = nlp(u"German Chancellor Angela Merkel and US President Barack Obama "
          u"converse in the Oval Office inside the White House in Washington, D.C.")
matches = matcher(doc)

edited Feb 22 '18 at 00:20

answered Feb 21 '18 at 19:22

matisetorm

857
8
21

1

If it's just a list of pre-known gazetteer, NLTK does that too https://stackoverflow.com/a/47664608/610569 =) – alvas Feb 22 '18 at 01:21
Totally. But the customization and ease of use have been rather impressive to me. I've used both, but Spacy has some other features that have made me start using it (outside the scope of this question). – matisetorm Feb 22 '18 at 01:23
That is not what I want. I want to to match any nouns, not only predefined ones. – Ala Głowacka Feb 22 '18 at 10:39
Read the rest of the documentation. It does allow you to automatically pair words based on POS and word dependencies. – matisetorm Feb 22 '18 at 11:34
Specifically, Spacy says `The rule matcher (class) also lets you pass in a custom callback to act on matches – for example, to merge entities and apply custom labels.` Those custom labels would be your new tokenization scheme. You could also leverage the dependency parsing section of the documentation. (https://spacy.io/usage/linguistic-features#section-dependency-parse). The tools are all there but you aren't going to find a plug-n-play solution for what you are asking for; Whether you would want to group "tax cuts" and/or "dishonest media" is highly use case dependent. – matisetorm Feb 22 '18 at 11:59

score 1 · Answer 2 · edited Jun 20 '20 at 09:12

1

To automatically detect common phrases from a stream of sentences I recommend you to check Gensim Phrase (collocation) detection

Good example of how it works:

bigram = Phraser(phrases)

sent = [u'the', u'mayor', u'of', u'new', u'york', u'was', u'there']

print(bigram[sent])

Output: [u'the', u'mayor', u'of', u'new_york', u'was', u'there']

edited Jun 20 '20 at 09:12

Community

1
1

answered Feb 22 '18 at 10:19

rudolfe

96
5

OP wants to pair up not noun chunks, (which gensim does quite well), but dependency parsing of words closely linked together.There are methods to do this in both NLTK and some other tokenization libraries, but OP seems to want something out of the box – matisetorm Feb 22 '18 at 11:51

RealRageDontQuit · Answer 3 · 2020-02-04T20:21:34.863

It depends on the ML model you use and how it has been trained. The standard would be nltk or textblob

textblob I believe should already be trained up for these language nuisances:

import re
x = ' '.join(re.sub('(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)', ' ', 'That is not bad example').split())

from textblob import TextBlob
analysis = TextBlob(x)
sentiment = analysis.sentiment.polarity

the above code should yield the following sentiments:

'That is bad example' : -0.6999999999999998    
'That is not bad example' : 0.3499999999999999
'That is good example' : 0.7
'That is not good example' : -0.35

already you can see that this sentiment analyzer has grasped some concept of double negative and negative positive. It can be trained further by invoking:

from textblob.classifiers import NaiveBayesClassifier
cl = NaiveBayesClassifier(training_Set)

and using clbut effort would be better spent by you defining what is a positive sentiment by some arbitrary threshold (if > 0.1 then good). I mean the 'bad' and 'not good' are already negative... so why try to re-invent the wheel?

ML right now is "smart enough"... you often have to use your own intelligence to bridge the gap that the machine lacks...

you can also use sentiment = analysis.sentiment.subjectivity to see how objective or subjective the text is to offer you more insight

How to split sentences into correlated words (term extraction)?

3 Answers3