1

I am attempting to write a function that will return a list of NLTK definitions for the 'tokens' tokenized from a text document subject to constraint of part of speech of the word.

I first converted the tag given by nltk.pos_tag to the tag used by wordnet.synsets and then applied .word_tokenize(), .pos_tag(), .synsets in turn, as seen in the following code:

import numpy as np
import nltk
from nltk.corpus import wordnet as wn
import pandas as pd

#convert the tag to the one used by wordnet.synsets

def convert_tag(tag):    
    tag_dict = {'N': 'n', 'J': 'a', 'R': 'r', 'V': 'v'}
    try:
        return tag_dict[tag[0]]
    except KeyError:
        return None

#tokenize, tag, and find synsets (give the first match between each 'token' and 'word net_tag')

def doc_to_synsets(doc):

    token = nltk.word_tokenize(doc)
    tag = nltk.pos_tag(token)
    wordnet_tag = convert_tag(tag)
    syns = wn.synsets(token, wordnet_tag)

    return syns[0]

#test
doc = 'document is a test'
doc_to_synsets(doc)

which, if programmed correctly, should return something like

[Synset('document.n.01'), Synset('be.v.01'), Synset('test.n.01')]

However, Python throws an error message:

'list' object has no attribute 'lower'

I also noticed that in the error message, it says

lemma = lemma.lower()

Does that mean I also need to 'lemmatize' my tokens as this previous thread suggest? Or should I apply .lower() on the text document before doing all these?

I will rather new to wordnet, don't really know whether it's .synsets that is causing the problem or it's the nltk part that is at fault. It will be really appreciated if someone could enlighten me on this.

Thank you.

[Edit] error traceback

AttributeError                            Traceback (most recent call last)
<ipython-input-49-5bb011808dce> in <module>()
     22     return syns
     23 
---> 24 doc_to_synsets('document is a test.')
     25 
     26 

<ipython-input-49-5bb011808dce> in doc_to_synsets(doc)
     18     tag = nltk.pos_tag(token)
     19     wordnet_tag = convert_tag(tag)
---> 20     syns = wn.synsets(token, wordnet_tag)
     21 
     22     return syns

/opt/conda/lib/python3.6/site-packages/nltk/corpus/reader/wordnet.py in synsets(self, lemma, pos, lang, check_exceptions)
   1481         of that language will be returned.
   1482         """
-> 1483         lemma = lemma.lower()
   1484 
   1485         if lang == 'eng':

AttributeError: 'list' object has no attribute 'lower'

So after using the code kindly suggested by @dugup and $udiboy1209, I get the following output

[[Synset('document.n.01'),
  Synset('document.n.02'),
  Synset('document.n.03'),
  Synset('text_file.n.01'),
  Synset('document.v.01'),
  Synset('document.v.02')],
 [Synset('be.v.01'),
  Synset('be.v.02'),
  Synset('be.v.03'),
  Synset('exist.v.01'),
  Synset('be.v.05'),
  Synset('equal.v.01'),
  Synset('constitute.v.01'),
  Synset('be.v.08'),
  Synset('embody.v.02'),
  Synset('be.v.10'),
  Synset('be.v.11'),
  Synset('be.v.12'),
  Synset('cost.v.01')],
 [Synset('angstrom.n.01'),
  Synset('vitamin_a.n.01'),
  Synset('deoxyadenosine_monophosphate.n.01'),
  Synset('adenine.n.01'),
  Synset('ampere.n.02'),
  Synset('a.n.06'),
  Synset('a.n.07')],
 [Synset('trial.n.02'),
  Synset('test.n.02'),
  Synset('examination.n.02'),
  Synset('test.n.04'),
  Synset('test.n.05'),
  Synset('test.n.06'),
  Synset('test.v.01'),
  Synset('screen.v.01'),
  Synset('quiz.v.01'),
  Synset('test.v.04'),
  Synset('test.v.05'),
  Synset('test.v.06'),
  Synset('test.v.07')],
 []]

The problem now comes down to extracting the first match (or first element) of each list from the list 'syns' and make them into a new list. For the trial document 'document is a test', it should return:

[Synset('document.n.01'), Synset('be.v.01'), Synset('angstrom.n.01'), Synset('trial.n.02')]

which is a list of the first match for each token in the text document.

Chris T.
  • 1,699
  • 7
  • 23
  • 45
  • Can you post the entire error traceback? – Tony Aug 29 '17 at 19:41
  • 1
    I included that in the 'edit' part below the original post. – Chris T. Aug 29 '17 at 19:43
  • Why are you using .lower()? – Rlz Aug 29 '17 at 19:49
  • I did not use .lower() in my code but saw that mentioned in a couple earlier question threads, hence, I raised here. – Chris T. Aug 29 '17 at 19:52
  • @ChrisT what is the result of the code if before the line `lemma = lemma.lower()` you put `print(lemma) ` – Rlz Aug 29 '17 at 19:57
  • I'm sorry I have to go run some errands, I miss understood your problem so i'll tackle this again in a hour if no one else has got to it. @RulerOfTheWorld, Chris is using NLTK slightly wrong triggering a warning inside that package. He's not calling lower. – Tony Aug 29 '17 at 19:58
  • It's not in my code and I wonder why lemma.lower() (or anything that has to do with 'lemma') appeared in the error message. – Chris T. Aug 29 '17 at 19:58
  • @Tony, thanks again for your assistance! I will also revise my original thread as 'take the 1st one' looks misleading. – Chris T. Aug 29 '17 at 20:00
  • @Tony ok thanks for clearing that up :) – Rlz Aug 29 '17 at 20:01

2 Answers2

3

The problem is that wn.synsets expects a single token as its first argument but word_tokenize returns a list containing all of the tokens in the document. So your token and tag variables are actually lists.

You need to loop through all of the token-tag pairs in your document and generate a synset for each individually using something like:

tokens = nltk.word_tokenize(doc)
tags = nltk.pos_tag(tokens)
doc_synsets = []
for token, tag in zip(tokens, tags):
    wordnet_tag = convert_tag(tag)
    syns = wn.synsets(token, wordnet_tag)
    # only add the first matching synset to results
    doc_synsets.append(syns[0])
dugup
  • 418
  • 4
  • 7
  • Thanks for your reply. I have tried your recommendation, it generated the same output as what @udiboy1209's code would have it. So how can I extract the first match of each token from the document and generate an output like the this (this is what I got by using your code): [Synset('document.n.01'), Synset('be.v.01'), Synset('angstrom.n.01'), Synset('trial.n.02')] – Chris T. Aug 29 '17 at 20:32
  • What exactly do you mean by "first match of each token"? `syns` is a list containing all the synonyms from wordnet for the given token - you can append the first element to the results instead of the whole list with `all_synsets.append(syns[0])` – dugup Aug 29 '17 at 21:41
  • syns returns a list of 'list of nltk reference' for each token that are potential matches between that token and wordnet_tag, if you have 5 tokens in a text document, then it will return 5 lists of reference for that 5 tokens. Using 'all_synsets.append(syns[0])' only returns the list of reference for the first token. I want to get the first match between each token and the wordnet_tag for all tokens in a text document. So the output should look like [Synset('document.n.01'), Synset('be.v.01'), Synset('test.n.01')] – Chris T. Aug 29 '17 at 21:48
  • I used first_match = [x[0] for x in doc_to_synsets('document is a test.')] to get a list of first matches for each token in the document, but Python returns an error message 'list index out of range.' I am confused. – Chris T. Aug 29 '17 at 21:53
  • I've updated my answer to append the first synset for each token to the results. At the end of the loop `doc_synsets` will be a list containing the first match for each token in the document and should be what you want – dugup Aug 29 '17 at 21:55
  • The output is the same as your previous recommendation and now it is un-subscriptable. – Chris T. Aug 29 '17 at 22:04
  • I guess the question now come down to (1) get the first element from a list of a list (i.e., syns), and (2) make those 'first elements' into a new list. For example, 'document is a test' has 4 tokens, so the desired output should be of the form [syns[0][0], syns[1][0], syns[2][0], syns[0][3]]. I tried to loop through the output 'syns' but kept getting 'list index out of range' error. – Chris T. Aug 29 '17 at 22:15
  • I added the syns output at the bottom of my original question, it is easier to visualize the desired output format that way. – Chris T. Aug 29 '17 at 22:17
1

lower() is a function of str type, which basically returns a lower-case version of the string.

It looks like nltk.word_tokenize() returns a list of words, and not a single word. But to synsets() you need to pass a single str, and not a list of str.

You may want to try running synsets in a loop like so:

for token in nltk.word_tokenize(doc):
    syn = wn.synsets(token)

EDIT: better use list comprehensions to get a list of syns

syns = [wn.synsets(token) for token in nltk.word_tokenize(doc)]
udiboy1209
  • 1,472
  • 1
  • 15
  • 33