I'm trying to implement a TFIDF vectorizer without sklearn. I want to count the number of documents(list of strings) in which a word appears, and so on for all the words in that corpus. Example:
corpus = [
'this is the first document',
'this document is the second document',
'and this is the third one',
'is this the first document',
]
Desired OP: {this : 4, is : 4}
and so on for every word
My code:
def docs(corpus):
doc_count = dict()
for line in corpus:
for word in line.split():
if word in line:
doc_count[word] +=1
else:
doc_count[word] = 1
print(counts)
docs(corpus)
Error I'm facing:
KeyError Traceback (most recent call last)
<ipython-input-70-6bf2b69708bc> in <module>
9 print(counts)
10
---> 11 docs(corpus)
<ipython-input-70-6bf2b69708bc> in docs(corpus)
4 for word in line.split():
5 if word in line.split():
----> 6 doc_count[word] +=1
7 else:
8 doc_count[word] = 1
KeyError: 'this'
Please let me know where I'm lacking and if I'm not iterating properly. Thank you!