5

I am doing sentiment analysis on given documents, my goal is I want to find out the closest or surrounding adjective words respect to target phrase in my sentences. I do have an idea how to extract surrounding words respect to target phrases, but How do I find out relatively close or closest adjective or NNP or VBN or other POS tag respect to target phrase.

Here is the sketch idea of how I may get surrounding words to respect to my target phrase.

sentence_List= {"Obviously one of the most important features of any computer is the human interface.", "Good for everyday computing and web browsing.",
"My problem was with DELL Customer Service", "I play a lot of casual games online[comma] and the touchpad is very responsive"}

target_phraseList={"human interface","everyday computing","DELL Customer Service","touchpad"}

Note that my original dataset was given as dataframe where the list of the sentence and respective target phrases were given. Here I just simulated data as follows:

import pandas as pd
df=pd.Series(sentence_List, target_phraseList)
df=pd.DataFrame(df)

Here I tokenize the sentence as follow:

from nltk.tokenize import word_tokenize
tokenized_sents = [word_tokenize(i) for i in sentence_List]
tokenized=[i for i in tokenized_sents]

then I try to find out surrounding words respect to my target phrases by using this loot at here. However, I want to find out relatively closer or closet adjective, or verbs or VBN respect to my target phrase. How can I make this happen? Any idea to get this done? Thanks

Hamilton
  • 620
  • 2
  • 14
  • 32
  • If your target phrase list is small then I can suggest an approach. Write a context free grammar using the target phrases and pos tags. Then you can parse the sentances using the CFG grammar written. https://medium.com/greyatom/learning-pos-tagging-chunking-in-nlp-85f7f811a8cb , follow the link for code snippets. – Ajay Naredla Nov 18 '18 at 11:08
  • @AjayNaredla can you elaborate your comment with few line code? Thanks – Hamilton Nov 18 '18 at 17:05

1 Answers1

2

Would something like the following work for you? I recognize there are some tweaks that need to be made to make this fully useful (checking for upper/lower case; it will also return the word ahead in the sentence rather than the one behind if there is a tie) but hopefully it is useful enough to get you started:

import nltk
from nltk.tokenize import MWETokenizer

def smart_tokenizer(sentence, target_phrase):
    """
    Tokenize a sentence using a full target phrase.
    """
    tokenizer = MWETokenizer()
    target_tuple = tuple(target_phrase.split())
    tokenizer.add_mwe(target_tuple)
    token_sentence = nltk.pos_tag(tokenizer.tokenize(sentence.split()))

    # The MWETokenizer puts underscores to replace spaces, for some reason
    # So just identify what the phrase has been converted to
    temp_phrase = target_phrase.replace(' ', '_')
    target_index = [i for i, y in enumerate(token_sentence) if y[0] == temp_phrase]
    if len(target_index) == 0:
        return None, None
    else:
        return token_sentence, target_index[0]


def search(text_tag, tokenized_sentence, target_index):
    """
    Search for a part of speech (POS) nearest a target phrase of interest.
    """
    for i, entry in enumerate(tokenized_sentence):
        # entry[0] is the word; entry[1] is the POS
        ahead = target_index + i
        behind = target_index - i
        try:
            if (tokenized_sentence[ahead][1]) == text_tag:
                return tokenized_sentence[ahead][0]
        except IndexError:
            try:
                if (tokenized_sentence[behind][1]) == text_tag:
                    return tokenized_sentence[behind][0]
            except IndexError:
                continue

x, i = smart_tokenizer(sentence='My problem was with DELL Customer Service',
                       target_phrase='DELL Customer Service')
print(search('NN', x, i))

y, j = smart_tokenizer(sentence="Good for everyday computing and web browsing.",
                       target_phrase="everyday computing")
print(search('NN', y, j))

Edit: I made some changes to address the issue of using an arbitrary length target phrase, as you can see in the smart_tokenizer function. The key there is the nltk.tokenize.MWETokenizer class (for more info see: Python: Tokenizing with phrases). Hopefully this helps. As an aside, I would challenge the idea that spaCy is necessarily more elegant - at some point, someone has to write the code to get the work done. This will either that will be the spaCy devs, or you as you roll your own solution. Their API is rather complicated so I'll leave that exercise to you.

HFBrowning
  • 2,196
  • 3
  • 23
  • 42
  • 1
    Hm, looking closer I can tell you're going to want the target phrases to stay together (`human interface` rather than `human`, `interface`). When I get a moment I'll work that up – HFBrowning Nov 20 '18 at 22:01
  • @Jerry I would recommend that if a solution using `spaCy` is what you are looking for, you should update your question to reflect that. Keeping in mind that SO is not a code-writing service and there are many ways to functionally accomplish the same task. It could be that your question should also be narrowed down to address the particular place you are stuck; to me at first blush it appeared you were confused about how to reach in and manipulate the data structure that `nltk.pos_tag` returned. – HFBrowning Nov 21 '18 at 20:26
  • @Jerry I've updated it now to keep the phrases together – HFBrowning Nov 28 '18 at 16:32
  • You can just modify the lines that said `if len(target_index) == 0: return "Target phrase not found"` to have it be `return None` instead – HFBrowning Nov 29 '18 at 00:09
  • You should try to figure some of these things out on your own. Try finding a python/pandas tutorial. Good luck – HFBrowning Nov 29 '18 at 15:44
  • I don't know why when I used `search` function for multiple sentences and target phrases, it returns `None` value instead. Why and how to fix this? in `smart_tokenizer` function, I set `[],[]` when `len(target_index) == 0`, any idea? – Hamilton Nov 30 '18 at 01:34
  • You need to look up how to apply a function over multiple inputs. In Python you could use a dictionary and for-loop, using pandas it would be `pandas.DataFrame.apply`. I won't answer any more basic Python questions in the comments - if you're truly stuck you should either do some reading and then if still stuck, post a new question. – HFBrowning Nov 30 '18 at 16:20