Say I have the following data in a csv file, example.csv
:
Word Score
Dog 1
Bird 2
Cat 3
Dog 2
Dog 3
Dog 1
Bird 3
Cat 1
Bird 1
Cat 3
I want to count the frequency of each word for each score. The expected output is the following:
1 2 3
Dog 2 1 1
Bird 0 1 1
Cat 1 0 2
My code to do this is as follows:
import pandas as pd
x1 = pd.read_csv(r'path\to\example.csv')
def getUniqueWords(allWords) :
uniqueWords = []
for i in allWords:
if not i in uniqueWords:
uniqueWords.append(i)
return uniqueWords
unique_words = getUniqueWords(x1['Word'])
unique_scores = getUniqueWords(x1['Score'])
scores_matrix = [[0 for x in range(len(unique_words))] for x in range(len(unique_scores)+1)]
# The '+1' is because Python indexing starts from 0; so if a score of 0 is present in the data, the 0 index will be used for that.
for i in range(len(unique_words)):
temp = x1[x1['Word']==unique_words[i]]
for j, word in temp.iterrows():
scores_matrix[i][j] += 1 # Supposed to store the count for word i with score j
But this gives the following error:
IndexError Traceback (most recent call last)
<ipython-input-123-141ab9cd7847> in <module>()
19 temp = x1[x1['Word']==unique_words[i]]
20 for j, word in temp.iterrows():
---> 21 scores_matrix[i][j] += 1
IndexError: list index out of range
Also, even if I could fix this error, the scores_matrix
would not show the headers (Dog
, Bird
, Cat
as row indices, and 1
, 2
, 3
as column indices). I would want to be able to access the count for each word with each score - something to this effect:
scores_matrix['Dog'][1]
>>> 2
scores_matrix['Cat'][2]
>>> 0
So, how would I solve/fix both these issues?