0

I am new to NLP. I am trying to search a corpus for Part-of-speech sequence. The goal would be to search for a sequence of POS tags and find all sentences that match sequence from a given corpus.

Input: The quick brown fox jumped over the lazy dogs. Tagger will process tag the sentence: POS tagging results in: [DT][JJ][JJ][NN][VBD][IN][DT][JJ][NNS][.] Apply search will result in any sentence that matches this sequence or longer.

How do I search by Part-of-Speech? Is there a direct function in NLTK or spacy?

I would appreciate some guidance on the steps needed to solve the problem and the challenges that I might face.

Note that I found someone who posted a similar question on stackoverflow, but I think the problem he was facing was more specific. Search POS

alandalusi
  • 1,145
  • 4
  • 18
  • 39
  • Does this answer your question? [Regular expressions in POS tagged NLTK corpus](https://stackoverflow.com/questions/15970033/regular-expressions-in-pos-tagged-nltk-corpus) – colidyre Apr 19 '20 at 23:44

1 Answers1

0

In the following code, s1 is the training sentence. s2 is any other sentence. This is not a direct code as asked by the questioner but would help applying to sentences and finding if any sentence matches the pattern.

#Invoke libraries
from nltk import word_tokenize, pos_tag  
import re

#Build functions
def tagsToString(t):
    sequence = ""
    return(sequence.join(t))

def sequenceMatch(s1,s2):
    tagSequence1 = ""
    taggedWords1 = pos_tag(word_tokenize(s1))
    tags1 = [tagged[1] for tagged in taggedWords1]
    tagSequence1 = "".join(tags1)

    tagSequence2 = ""
    taggedWords2 = pos_tag(word_tokenize(s2))
    tags2 = [tagged[1] for tagged in taggedWords2]
    tagSequence2 = "".join(tags2)

    if tagSequence2.find(tagSequence1) == 0:
        match = "yes"    
    else:
        match = "no"
    return(match)
#Example:
s1 = "The quick brown fox"
s2 = "The quick brown fox jumped over the lazy dog"

sequenceMatch(s1,s2)
#'yes'