I have an excel file with many rows and columns. I want to do the following. First, I want to filter the rows based on a text match. Second, I want to choose a particular column and generate word frequency for ALL THE WORDS in that column. Third, I want to graph the word and frequency.
I have figured out the first part. My question is how to apply Counter() on a dataframe. If I just use Counter(df), it returns an error. So, I used the following code to convert each row into a list and then applied Counter. When I do this, I get the word frequency for each row separately (if I use counter within the for loop, else I get the word frequency for just one row). However, I want a word count for all the rows put together. Appreciate any inputs. Thanks! The following is an example data.
product review
a Great Product
a Delivery was fast
a Product received in good condition
a Fast delivery but useless product
b Dont recommend
b I love it
b Please dont buy
b Second purchase
My desired output is like this: for product a - (product,3),(delivery,2)(fast,2) etc..my current output is like (great,1), (product,1) for the first row.
This is the code I used.
strdata = column.values.tolist()
tokens = [tokenizer.tokenize(str(i)) for i in strdata]
cleaned_list = []
for m in tokens:
stopped = [i for i in m if str(i).lower() not in stop_words]
stemmed = [stemmer.stem(i) for i in stopped]
cleaned_list.append(stopped) #append stemmed words to list
count = Counter(stemmed)
print(count.most_common(10))