The FreqDist
function takes in an iterable of hashable objects (made to be strings, but it probably works with whatever). The error you're getting is because you pass in an iterable of lists. As you suggested, this is because of the change you made:
df['tokenized_sents'] = df['Responses'].apply(nltk.word_tokenize)
If I understand the Pandas apply function documentation correctly, that line is applying the nltk.word_tokenize
function to some series. word-tokenize
returns a list of words.
As a solution, simply add the lists together before trying to apply FreqDist
, like so:
allWords = []
for wordList in words:
allWords += wordList
FreqDist(allWords)
A more complete revision to do what you would like. If all you need is to identify the second set of 100, note that mclist
will have that the second time.
df = pd.read_csv('CountryResponses.csv', encoding='utf-8', skiprows=0, error_bad_lines=False)
tokenizer = RegexpTokenizer(r'\w+')
df['tokenized_sents'] = df['Responses'].apply(nltk.word_tokenize)
lists = df['tokenized_sents']
words = []
for wordList in lists:
words += wordList
#remove 100 most common words based on Brown corpus
fdist = FreqDist(brown.words())
mostcommon = fdist.most_common(100)
mclist = []
for i in range(len(mostcommon)):
mclist.append(mostcommon[i][0])
words = [w for w in words if w not in mclist]
#Out: ['the',
# ',',
# '.',
# 'of',
# 'and',
#...]
#keep only most common words
fdist = FreqDist(words)
mostcommon = fdist.most_common(100)
mclist = []
for i in range(len(mostcommon)):
mclist.append(mostcommon[i][0])
# mclist contains second-most common set of 100 words
words = [w for w in words if w in mclist]
# this will keep ALL occurrences of the words in mclist