How to correlate similar messages using NLP

Question

I have couple of tweets which needs to be processed. I am trying to find occurrences of messages where it mean some harm to a person. How do I go about achieving this via NLP

I bought my son a toy gun
I shot my neighbor with a gun
I don't like this gun
I would love to own this gun
This gun is a very good buy
Feel like shooting myself with a gun

In the above sentences, the 2nd, 6th one is what I would like to find.

There is a *lot* of research in this area. It would probably be a good idea to start reading some papers or book chapters on classification and semantic processing. — Hunter McMillen, Jun 26 '13 at 13:02
Lescai's got it. Don't you even worry about it. Just let the NSA handle it. — G. Blake Meike, Jul 04 '13 at 00:04

score 1 · Answer 1 · answered Jun 27 '13 at 08:49

If the problem is restricted only to guns and shooting, then you could use a dependency parser (like the Stanford Parser) to find verbs and their (prepositional) objects, starting with the verb and tracing its dependants in the parse tree. For example, in both 2 and 6 these would be "shoot, with, gun".

Then you can use a list of (near) synonyms for "shoot" ("kill", "murder", "wound", etc) and "gun" ("weapon", "rifle", etc) to check if they occur in this pattern (verb - preposition - noun) in each sentence.

There will be other ways to express the same idea, e.g. "I bought a gun to shoot my neighbor", where the dependency relation is different, and you'd need to detect these types of dependencies too.

score 1 · Answer 2 · edited Jun 20 '20 at 09:12

All of vpekar's suggestions are good. Here is some python code that will at least parse the sentences and see if they contain verbs in a user defined set of harm words. Note: most 'harm words' probably have multiple senses, many of which could have nothing to do with harm. This approach does not attempt to disambiguate word sense.

(This code assumes you have NLTK and Stanford CoreNLP)

import os
import subprocess
from xml.dom import minidom
from nltk.corpus import wordnet as wn

def StanfordCoreNLP_Plain(inFile):
    #Create the startup info so the java program runs in the background (for windows computers)
    startupinfo = None
    if os.name == 'nt':
        startupinfo = subprocess.STARTUPINFO()
        startupinfo.dwFlags |= subprocess.STARTF_USESHOWWINDOW
    #Execute the stanford parser from the command line
    cmd = ['java', '-Xmx1g','-cp', 'stanford-corenlp-1.3.5.jar;stanford-corenlp-1.3.5-models.jar;xom.jar;joda-time.jar', 'edu.stanford.nlp.pipeline.StanfordCoreNLP', '-annotators', 'tokenize,ssplit,pos', '-file', inFile]
    output = subprocess.Popen(cmd, stdout=subprocess.PIPE, startupinfo=startupinfo).communicate()
    outFile = file(inFile[(str(inFile).rfind('\\'))+1:] + '.xml')
    xmldoc = minidom.parse(outFile)
    itemlist = xmldoc.getElementsByTagName('sentence')
    Document = []
    #Get the data out of the xml document and into python lists
    for item in itemlist:
        SentNum = item.getAttribute('id')
        sentList = []
        tokens = item.getElementsByTagName('token')
        for d in tokens:
            word = d.getElementsByTagName('word')[0].firstChild.data
            pos = d.getElementsByTagName('POS')[0].firstChild.data
            sentList.append([str(pos.strip()), str(word.strip())])
        Document.append(sentList)
    return Document

def FindHarmSentence(Document):
    #Loop through sentences in the document.  Look for verbs in the Harm Words Set.
    VerbTags = ['VBN', 'VB', 'VBZ', 'VBD', 'VBG', 'VBP', 'V']
    HarmWords = ("shoot", "kill")
    ReturnSentences = []
    for Sentence in Document:
        for word in Sentence:
            if word[0] in VerbTags:
                try:
                    wordRoot = wn.morphy(word[1],wn.VERB)
                    if wordRoot in HarmWords:
                        print "This message could indicate harm:" , str(Sentence)
                        ReturnSentences.append(Sentence)
                except: pass
    return ReturnSentences

#Assuming your input is a string, we need to put the strings in some file.
Sentences = "I bought my son a toy gun. I shot my neighbor with a gun. I don't like this gun. I would love to own this gun. This gun is a very good buy. Feel like shooting myself with a gun."
ProcessFile = "ProcFile.txt"
OpenProcessFile = open(ProcessFile, 'w')
OpenProcessFile.write(Sentences)
OpenProcessFile.close()

#Sentence split, tokenize, and part of speech tag the data using Stanford Core NLP
Document = StanfordCoreNLP_Plain(ProcessFile)

#Find sentences in the document with harm words
HarmSentences = FindHarmSentence(Document)

This outputs the following:

This message could indicate harm: [['PRP', 'I'], ['VBD', 'shot'], ['PRP$', 'my'], ['NN', 'neighbor'], ['IN', 'with'], ['DT', 'a'], ['NN', 'gun'], ['.', '.']]

This message could indicate harm: [['NNP', 'Feel'], ['IN', 'like'], ['VBG', 'shooting'], ['PRP', 'myself'], ['IN', 'with'], ['DT', 'a'], ['NN', 'gun'], ['.', '.']]

score 0 · Answer 3 · answered Jul 06 '13 at 02:43

I would have a look at SenticNet

http://sentic.net/sentics

It provides an open source knowledge base and parser that assigns emotional value to text fragments. Using the library, you could train it to recognize statements that you're interested in.

How to correlate similar messages using NLP

3 Answers3