3

I have a dataframe that contains a list in each row.

For example:

+--------------------+-----+
|             removed|stars|
+--------------------+-----+
|[giant, best, buy...|  3.0|
|[wow, surprised, ...|  4.0|
|[one, day, satisf...|  3.0|

I want to apply lemmatizer on each row with

from nltk.stem import WordNetLemmatizer 
lemmatizer = WordNetLemmatizer()
df_list = df_removed.withColumn("removed",lemmatizer.lemmatize(df_removed["removed"]))

I'm getting an error:

TypeError: unhashable type: 'Column'

I don't want to use rdd and map function, just use lemmatizer on dataframe. How should I do this? How to fix this error?

milva
  • 151
  • 1
  • 2
  • 9
  • look for `word_tokenize` ex: `df['tokenized_sents'] = df['Responses'].apply(nltk.word_tokenize)` – Jasar Orion Apr 15 '20 at 16:03
  • Create a function `def fun(x): return [lemmatizer.lemmatize(i) for i in x]` and replace `fun` with `translate` in the linked answer – anky Apr 15 '20 at 16:32
  • Hi, why word_tokenize? I already have my words splitted, I just need to lemmatize them – milva Apr 15 '20 at 16:43

1 Answers1

0

The FreqDist function takes in an iterable of hashable objects (made to be strings, but it probably works with whatever). The error you're getting is because you pass in an iterable of lists. As you suggested, this is because of the change you made:

df['tokenized_sents'] = df['Responses'].apply(nltk.word_tokenize)

If I understand the Pandas apply function documentation correctly, that line is applying the nltk.word_tokenize function to some series. word-tokenize returns a list of words.

As a solution, simply add the lists together before trying to apply FreqDist, like so:

allWords = []
for wordList in words:
    allWords += wordList
FreqDist(allWords)

A more complete revision to do what you would like. If all you need is to identify the second set of 100, note that mclist will have that the second time.

df = pd.read_csv('CountryResponses.csv', encoding='utf-8', skiprows=0, error_bad_lines=False)

tokenizer = RegexpTokenizer(r'\w+')
df['tokenized_sents'] = df['Responses'].apply(nltk.word_tokenize)

lists =  df['tokenized_sents']
words = []
for wordList in lists:
    words += wordList

#remove 100 most common words based on Brown corpus
fdist = FreqDist(brown.words())
mostcommon = fdist.most_common(100)
mclist = []
for i in range(len(mostcommon)):
    mclist.append(mostcommon[i][0])
words = [w for w in words if w not in mclist]

#Out: ['the',
# ',',
# '.',
# 'of',
# 'and',
#...]

#keep only most common words
fdist = FreqDist(words)
mostcommon = fdist.most_common(100)
mclist = []
for i in range(len(mostcommon)):
    mclist.append(mostcommon[i][0])
# mclist contains second-most common set of 100 words
words = [w for w in words if w in mclist]
# this will keep ALL occurrences of the words in mclist
Jasar Orion
  • 626
  • 7
  • 26