0

I am preprocessing text data. However, I am facing issue with lemmatizing. Below is the sample text:

'An 18-year-old boy was referred to prosecutors Thursday for allegedly stealing about ¥15 million ($134,300) worth of cryptocurrency last year by hacking a digital currency storage website, police said.', 'The case is the first in Japan in which criminal charges have been pursued against a hacker over cryptocurrency losses, the police said.', '\n', 'The boy, from the city of Utsunomiya, Tochigi Prefecture, whose name is being withheld because he is a minor, allegedly stole the money after hacking Monappy, a website where users can keep the virtual currency monacoin, between Aug. 14 and Sept. 1 last year.', 'He used software called Tor that makes it difficult to identify who is accessing the system, but the police identified him by analyzing communication records left on the website’s server.', 'The police said the boy has admitted to the allegations, quoting him as saying, “I felt like I’d found a trick no one knows and did it as if I were playing a video game.”', 'He took advantage of a weakness in a feature of the website that enables a user to transfer the currency to another user, knowing that the system would malfunction if transfers were repeated over a short period of time.', 'He repeatedly submitted currency transfer requests to himself, overwhelming the system and allowing him to register more money in his account.', 'About 7,700 users were affected and the operator will compensate them.', 'The boy later put the stolen monacoins in an account set up by a different cryptocurrency operator, received payouts in a different cryptocurrency and bought items such as a smartphone, the police said.', 'According to the operator of Monappy, the stolen monacoins were kept using a system with an always-on internet connection, and those kept offline were not stolen.'

My code is:

import pandas as pd
import nltk
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

df = pd.read_csv('All Articles.csv')
df['Articles'] = df['Articles'].str.lower()

stemming = PorterStemmer()
stops = set(stopwords.words('english'))
lemma = WordNetLemmatizer()

def identify_tokens(row):
    Articles = row['Articles']
    tokens = nltk.word_tokenize(Articles)
    token_words = [w for w in tokens if w.isalpha()]
    return token_words


df['words'] = df.apply(identify_tokens, axis=1)


def stem_list(row):
    my_list = row['words']
    stemmed_list = [stemming.stem(word) for word in my_list]
    return (stemmed_list)


df['stemmed_words'] = df.apply(stem_list, axis=1)


def lemma_list(row):
    my_list = row['stemmed_words']
    lemma_list = [lemma.lemmatize(word, pos='v') for word in my_list]
    return (lemma_list)


df['lemma_words'] = df.apply(lemma_list, axis=1)


def remove_stops(row):
    my_list = row['lemma_words']
    meaningful_words = [w for w in my_list if not w in stops]
    return (meaningful_words)


df['stem_meaningful'] = df.apply(remove_stops, axis=1)


def rejoin_words(row):
    my_list = row['stem_meaningful']
    joined_words = (" ".join(my_list))
    return joined_words


df['processed'] = df.apply(rejoin_words, axis=1)

As it is clear from the code that I am using pandas. However here I have given sample text.

My problem area is :

def lemma_list(row):
    my_list = row['stemmed_words']
    lemma_list = [lemma.lemmatize(word, pos='v') for word in my_list]
    return (lemma_list)

df['lemma_words'] = df.apply(lemma_list, axis=1)

Though the code is running without any error lemma function is not working expectedly.

Thanks in Advance.

Piyush Ghasiya
  • 515
  • 7
  • 25

1 Answers1

1

In your code above you are trying to lemmatize words that have been stemmed. When the lemmatizer runs into a word that it doesn't recognize, it'll simply return that word. For instance stemming offline produces offlin and when you run that through the lemmatizer it just gives back the same word, offlin.

Your code should be modified to lemmatize words, like this...

def lemma_list(row):
    my_list = row['words']  # Note: line that is changed
    lemma_list = [lemma.lemmatize(word, pos='v') for word in my_list]
    return (lemma_list)
df['lemma_words'] = df.apply(lemma_list, axis=1)
print('Words: ',  df.ix[0,'words'])
print('Stems: ',  df.ix[0,'stemmed_words'])
print('Lemmas: ', df.ix[0,'lemma_words'])

This produces...

Words:  ['and', 'those', 'kept', 'offline', 'were', 'not', 'stolen']
Stems:  ['and', 'those', 'kept', 'offlin',  'were', 'not', 'stolen']
Lemmas: ['and', 'those', 'keep', 'offline', 'be',   'not', 'steal']

Which is is correct.

bivouac0
  • 2,494
  • 1
  • 13
  • 28
  • Thanks this is working. But one question, you said in your your earlier post that part-of-speech is also required in the WordNetLemmatizer. But here you didn't use it. I want to know does this code only lemmatizes 'verb' or every part-of-speech like noun, adjective, adverb etc. – Piyush Ghasiya Oct 30 '19 at 04:37
  • The code above uses `pos='v'` so it's going to apply the verb rules. I believe the way Wordnet works is that it applies the noun rules by default and if you want it operate on a verb, etc.. you need to tell it the POS type. That means that you need to run a tagger on the original tokenized text and then pass `n`, `v` or `a` to Wordnet if you want it to work correctly for all words. – bivouac0 Oct 30 '19 at 04:43
  • >That means that you need to run a tagger on the original tokenized text and then pass n, v or a to Wordnet if you want it to work correctly for all words. Can you tell how to do that in code? – Piyush Ghasiya Oct 30 '19 at 05:24
  • Take a look at [NLTK](https://www.nltk.org/book/ch05.html). If the tag starts with an `N` pass `n` to the lemmatizer, etc.. You might consider using [Spacy](https://spacy.io/) instead. It's somewhat easier and more accurate. – bivouac0 Oct 30 '19 at 13:20