0

I have a list of tuples that are generated from a string using NLTK's PoS tagger.

I'm trying to find the the "intent" of a specific string in order to append it to a dataframe, so I need a way to generate a syntax/grammar rule.

string = "RED WHITE AND BLUE"

string_list = nltk.pos_tag(a.split())

string_list = [('RED', 'JJ'), ('WHITE', 'NNP'), ('AND', 'NNP'), ('BLUE', 'NNP')]

The strings vary in size, from 2-3 elements all the way to full on paragraphs (40-50+) so I'm wondering if there is a general form or rule that I can create in order to parse a sentence.

So if I want find a pattern in a list an example pseudocode output would be:

string_pattern = "I want to kill all the bad guys in the Halo Game"

pattern = ('I', 'PRP') + ('want', 'VBP') + ('to', 'TO') + ('kill:', 'JJ') + ('all', 'DT') + ('bad', 'JJ') + ('guys', 'NNS') + ('in', 'IN') + ('Halo', 'NN') + ('Game', 'NN')

Ideally I would be able to match part of the pattern in a tagged string, so it finds:

('I', 'PRP') + ('want', 'VBP') + ('to', 'TO') + ('kill:', 'JJ')

but it doesn't need the rest, or vice versa it can find multiple examples of the pattern in the same string, in the event that the string is a paragraph. If anyone knows the best way to do this or knows a better alternative it would be really helpful!

Sebastian Goslin
  • 477
  • 1
  • 3
  • 22

1 Answers1

1

The simplest method I can think of is using brute force (sure, you could adapt it or even use some machine learning to help find classes for easier matching).

A simple bruteforce method follows:

Tag the String

string_list = nltk.pos_tag(a.split())

Create a list of expected tags

pos_tags = ["NN", "VBP", "NN"]

The following function will be able to check wheter this pattern appears:

def find_match(string_list, pos_tags)

    num_matched = 0
    match_start_pos = 0
    matched = False
    #Enumerating gives you an index to compare to enable you to find where matching starts
    for idx, tuple in enumerate(string_list):
        if tuple[1] == pos_tags[num_matched]:
            num_matched += 1
            if num_matched == 0:
                match_start_pos = idx
        else: 
            num_matched = 0
        if num_matched == len(pos_tags):
            matched = True
            break
    return (matched, match_start_pos)

More Realistically:

Now, more practically, Suppose you belong to a Civilian protection agency and want to be aware of any tweet made by school students mentioning killing. You somehow filter the tweets and want to check if someone wants to kill anyone else.

With just a little modification, you can achieve at something similar (the following ideas are somehow powered by what is called Frame Semantics):

killing_intent_dict = {"PRP":set("I", "YOU", "He", "She"), "V": set("kill"), "NNP":set("All", "him", "her")}
if find_match_pattern(string_list, killing_intent_dict):
#    someone wants to kill! Call 911

def find_match_pattern(string_list, pattern_dict) 
    num_matched = 0
    match_start_pos = 0
    matched = False
    #Enumerating gives you an index to compare to enable you to find where matching starts
    for idx, tuple in enumerate(string_list):
        if tuple[1] == pattern_dict.keys()[num_matched]:
            if tuple[0] in pattern_dict[tuple[1]]:
                num_matched += 1
                if num_matched == 0:
                    match_start_pos = idx
            else:
                num_matched = 0
        else: 
            num_matched = 0
        if num_matched == len(pattern_dict):
            matched = True
            break
    return (matched, match_start_pos)

Keep in mind that this is all experimental and requires a lot of hand coding. You can add to it NER tags so you can abstract names.

Appending another possibility, similar to the one I used in my master's research:

Instead of using a linear bruteforce mechanism, you could create a graph containing the actions, agents and intents, connecting them all. You then use some sort of graph spreading algorithm while your program reads the input. You can read more in my research, but keep in mind that this topic that you are asking (Natural Language Understanding) is deep and under development: https://drive.google.com/open?id=12gWLx2saFe5mZI96roUG_p1YfzrqVNbx

Tiago Duque
  • 1,956
  • 1
  • 12
  • 31
  • This is sheer beauty, it runs for a few of the entries in my dataframe (where the strings are), however I'm running into an indexing error that I'm trying to diagnose `in find_match if tuple[1] == flag_list[num_matched]: IndexError: list index out of range` – Sebastian Goslin Aug 29 '19 at 17:10
  • 1
    Check the items in your tuple, it might be returning empty tuples. Also, check if you're not exploding flag_list lenght (you have to return or break after you've found a perfect match). Which solution u're using? – Tiago Duque Aug 29 '19 at 17:26
  • I'm using the first one, I'm checking the tuples now, I think I might have passed the wrong list/dict thats why – Sebastian Goslin Aug 29 '19 at 19:53
  • The tuples it was passing were nulls which I fixed, in this case its breaking once it hits the length of the flag_list correct? – Sebastian Goslin Aug 29 '19 at 20:02
  • 1
    I just corrected an error in the code. Change if matched == len(pos_tags): to if num_matched == len(pos_tags): – Tiago Duque Aug 29 '19 at 20:58