Providing extracted lemma for each sentences using treetaggerwrapper does not work : return list of words instead list of word for each sentences

Question

Here is my function which is supposed to lemmatize a list of sentences but the output is a list of all words but not a list of each lemmatized sentences.

Code for lemmatize function

tagger = treetaggerwrapper.TreeTagger(TAGLANG='fr') 
def lemmatize(corpus):
    lemmatize_list_of _sentences= []
    lemmatize_list_of _sentences2 = []
    for sentence in corpus:
        tags = tagger.tag_text(sentence)
        tags2 = treetaggerwrapper.make_tags(tags, allow_extra = True)
        lemmatize_list_of_sentences.append(tags2)
        print(lemmatize_list_of_sentences)
        for subl in lemmatize_list_of_sentences: # loop in list of sublists 
            for word in subl:
                if word.__class__.__name__ == "Tag":
                    lemme=word[2] #  I want also to check if lemme[2] is empty and add this 
                    lemmeOption2=lemme.split("|")
                    lemme=lemmeOption1[0]
                    lemmatize_list_of_sentences2.append(lemme)


    return lemmatize_list_of_sentences2 # should return a list of lists where each list contains the lemme retrieve



lemmatize_train= lemmatize(sentences_train_remove_stop_words)
lemmatize_test= lemmatize(sentences_test_remove_stop_words)
print(lemmatize_train)

Furthermore , i would like to add the lemmatize function a line of code to check if the index(2) or (-1) is empty, and if it is empty retrieve the word at the first index

I come up with this but how can i combine it with my lemmatize function

for word in subl:
        lemme= word.split('\t')
        try:
            if lemme[2] == '':
                lemmatize_list_of _sentences2.append(parts[0])
            else:
                lemmatize_list_of _sentences2.append(parts[2])
        except:
            print(parts)

list of sentences in the file_input

La période de rotation de la Lune est la même que sa période orbitale et elle présente donc toujours le même hémisphère. 
Cette rotation synchrone résulte des frottements qu’ont entraînés les marées causées par la Terre.

After tagging the text, and print the list of sentences_tagging , I have this :

first sentence :

[[Tag(word='la', pos='DET:ART', lemma='le'), Tag(word='période', pos='NOM', lemma='période'), Tag(word='rotation', pos='NOM', lemma='rotation'), Tag(word='lune', pos='NOM', lemma='lune'), Tag(word='période', pos='NOM', lemma='période'), Tag(word='orbitale', pos='ADJ', lemma='orbital'), Tag(word='présente', pos='VER:pres', lemma='présenter'), Tag(word='donc', pos='ADV', lemma='donc'), Tag(word='toujours', pos='ADV', lemma='toujours')]]

whole sentences:

[[Tag(word='la', pos='DET:ART', lemma='le'), Tag(word='période', pos='NOM', lemma='période'), Tag(word='rotation', pos='NOM', lemma='rotation'), Tag(word='lune', pos='NOM', lemma='lune'), Tag(word='période', pos='NOM', lemma='période'), Tag(word='orbitale', pos='ADJ', lemma='orbital'), Tag(word='présente', pos='VER:pres', lemma='présenter'), Tag(word='donc', pos='ADV', lemma='donc'), Tag(word='toujours', pos='ADV', lemma='toujours')], [Tag(word='cette', pos='PRO:DEM', lemma='ce'), Tag(word='rotation', pos='NOM', lemma='rotation'), Tag(word='synchrone', pos='ADJ', lemma='synchrone'), Tag(word='résulte', pos='VER:pres', lemma='résulter'), Tag(word='frottements', pos='NOM', lemma='frottement'), Tag(word='entraînés', pos='VER:pper', lemma='entraîner'), Tag(word='les', pos='DET:ART', lemma='le'), Tag(word='marées', pos='NOM', lemma='marée'), Tag(word='causées', pos='VER:pper', lemma='causer')]]

After retrieving the lemma I have a list of word , which is not what i expected. Expected a list for each sentences.

Output :

['le', 'période', 'rotation', 'lune', 'période', 'orbital', 'présenter', 'donc', 'toujours', 'ce', 'rotation', 'synchrone', 'résulter', 'frottement', 'entraîner', 'le', 'marée', 'causer']

Expected : to have each word of the sentence in a single string with spaces between the word.


['le période rotation lune période orbital présenter donc toujours','ce rotation synchrone résulter frottement entraîner le marée causer']

It seems that your list is filled with "Tag" objects. Instead of checking for empty spaces in the sublist, you could check if the item in the list is of type "Tag" with isinstance(variable, Class) function. After checking it, you could try to check if the attribute is empty with (variable.word == ""). Also, take into account that lemmatizers work based on words, not phrases - that's why its all split. Try concatenating them with " ".join(listname). — Tiago Duque, Jun 06 '19 at 11:43
@Tiago Duque thank you but how can i write the line toi check if the variable Inside the class is empty : if word.__class__.__name__ == "Tag": variable = lemme=word[2] if variable == " ": lemme=word[0] else: lemme=word[2] lemmeOption1=lemme.split("|") lemme=lemmeOption1[0] #print(lemme) lemmatize_list_of_sentences2.append( ''.join(lemme )) — kely789456123, Jun 06 '19 at 16:37
Look at the code here: https://pastebin.com/TDXHJQ4B and see if it solves for your case — Tiago Duque, Jun 06 '19 at 16:54
Have this : if isinstance(word, Tag): NameError: name 'Tag' is not defined — kely789456123, Jun 07 '19 at 07:29
Have this : if isinstance(word, Tag): NameError: name 'Tag' is not defined but when i use this it is working : if word.__class__.__name__ == "Tag": , my ptoblem is still the list of word, how can I obtain a list for each sentences and not all the sentences together. — kely789456123, Jun 07 '19 at 09:03
Ok, now I understood your problem. I'll post an answer with that soon. — Tiago Duque, Jun 07 '19 at 11:03
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/194590/discussion-between-tiago-duque-and-kely789456123). — Tiago Duque, Jun 07 '19 at 11:10

Tiago Duque · Answer 1 · 2019-06-10T10:48:24.217

So you want to have two lists of tags.

You are returning a simple list, you have to make sure you are returning a list of lists.

tagger = treetaggerwrapper.TreeTagger(TAGLANG='fr') 
def lemmatize(corpus):
    lemmatize_list_of_sentences= []
    lemmatize_list_of_sentences2 = []
    for sentence in corpus:
        tags = tagger.tag_text(sentence)
        tags2 = treetaggerwrapper.make_tags(tags, allow_extra = True)
        lemmatize_list_of_sentences.append(tags2)
        print(lemmatize_list_of_sentences)
        for subl in lemmatize_list_of_sentences: # loop in list of sublists
            #Here you create a list to work as a "inner" sentence list.
            sentence_lemmas = []
            for word in subl:
                if word.__class__.__name__ == "Tag":
                    lemme=word[2] #  I want also to check if lemme[2] is empty and add this 
                    lemmeOption2=lemme.split("|")
                    lemme=lemmeOption2[0] #There was a typo here
                    sentence_lemmas.append(lemme) #Here you append the lemma extracted
            # Here you change the original list in order for it to receive the "inner" list.
            lemmatize_list_of_sentences2.append(sentence_lemmas)


    return lemmatize_list_of_sentences2 # should return a list of lists where each list contains the lemme retrieve



lemmatize_train= lemmatize(sentences_train_remove_stop_words)
lemmatize_test= lemmatize(sentences_test_remove_stop_words)
print(lemmatize_train)

Checking if tag is empty

Also, from the docs (Tree tagger wraper docs), "Tag" is a "named tuple".

You can understand more about "named tuples" in this post.

But, basically, you can refer to "Tag" attributes in the same way as you would to objects, suing the . (dot) notation.

So, to check if the lemma is is empty, you can do:

if word.lemma != "":
   lemme = word.lemma
else:
   lemme = word.word.split("/")

Joining lists

Also, if you want to re-join the lemma list in the end, do:

joined_sentences = []
for lemma_list in lemmatize_train:
   joined_sentences.append(" ".join(lemma_list))

print(joined_sentences)

Function returning joined strings:

def lemmatize(corpus):
        lemmatize_list_of_sentences= []
        lemmatize_list_of_sentences2 = []
        for sentence in corpus:
            tags = tagger.tag_text(sentence)
            tags2 = treetaggerwrapper.make_tags(tags, allow_extra = True)
        lemmatize_list_of_sentences.append(tags2)
        print(lemmatize_list_of_sentences)
        for subl in lemmatize_list_of_sentences: # loop in list of sublists
            #Here you create a list to work as a "inner" sentence list.
            sentence_lemmas = []
            for word in subl:
                if word.__class__.__name__ == "Tag":
                    lemme=word[2] #  I want also to check if lemme[2] is empty and add this 
                    lemmeOption2=lemme.split("|")
                    lemme=lemmeOption2[0] #There was a typo here
                    sentence_lemmas.append(lemme) #Here you append the lemma extracted

            lemmatize_list_of_sentences2.append(sentence_lemmas)
    joined_sentences= []
    for lemma_list in lemmatize_list_of_sentences2:
       joined_sentences.append(" ".join(lemma_list))
    return joined_sentences

Hope it is clear now.

thank you but i still have a list of the sentences not a list of list of sentences containing only the lemme. — kely789456123, Jun 07 '19 at 14:31
So you want: ["this is a lemmatized string", "this is another one"]? The last part of the answer does it. You can add that block before outputting (changing the variable) in the function no problem. — Tiago Duque, Jun 07 '19 at 16:20
the function is returnning a list of strings from all the sentences but the problem is that at the beguinning i looking at one sentences after each sentences so the logical answer should be a string of all the lemma for each sentences passing through my function. for example i have : list = [" i am sleeping", " he is so busy"], final list should look like this : list= ["be sleep", "be so busy" ] not list = [ "be", "sleep","be","so","busy"] , the last one is what your code print. — kely789456123, Jun 08 '19 at 22:38
This is what the "Joining lists" part in the answer do. I modified the function so it returns a list of lists of lemmas. Then I gave you a way to make each inner list into a string in a list. Anyway, I'll modify the answer so you can have a "final" function with list of lemmatized strings. — Tiago Duque, Jun 10 '19 at 10:45

Providing extracted lemma for each sentences using treetaggerwrapper does not work : return list of words instead list of word for each sentences

1 Answers1