-2

I'm trying to implement a TFIDF vectorizer without sklearn. I want to count the number of documents(list of strings) in which a word appears, and so on for all the words in that corpus. Example:

corpus = [
     'this is the first document',
     'this document is the second document',
     'and this is the third one',
     'is this the first document',
]

Desired OP: {this : 4, is : 4} and so on for every word

My code:

def docs(corpus):
    doc_count = dict()
    for line in corpus:
        for word in line.split():
            if word in line:
                doc_count[word] +=1
            else:
                doc_count[word] = 1
        print(counts)

docs(corpus)

Error I'm facing:

KeyError                                  Traceback (most recent call last)
<ipython-input-70-6bf2b69708bc> in <module>
      9         print(counts)
     10 
---> 11 docs(corpus)

<ipython-input-70-6bf2b69708bc> in docs(corpus)
      4         for word in line.split():
      5             if word in line.split():
----> 6                 doc_count[word] +=1
      7             else:
      8                 doc_count[word] = 1

KeyError: 'this'

Please let me know where I'm lacking and if I'm not iterating properly. Thank you!

Yash Vyas
  • 34
  • 7
  • Your logic make no sense. You split line and are iterating over each word. But you are checking if that word exists in that line? Pretty sure is going to be always true. It literally comes from the line split. – Yoshikage Kira May 28 '21 at 06:59
  • You seem to have a typo. `if word in line` is trivially true because that's where it came from. – tripleee May 28 '21 at 06:59
  • Does this answer your question? [How to count one specific word in Python?](https://stackoverflow.com/questions/38401099/how-to-count-one-specific-word-in-python) – Yoshikage Kira May 28 '21 at 07:03
  • @Goion Thank you for pointing out the logic mistake, that is exactly where I was wrong. Also, the other link is not what I was looking for. – Yash Vyas May 28 '21 at 07:17

1 Answers1

0
corpus = [
     'this is the first document',
     'this document is the second document',
     'and this is the third one',
     'is this the first document',
]

def docs(corpus):
    doc_count = dict()
    for line in corpus:
        for word in line.split():
            #you did mistake here
            if word in doc_count:
                doc_count[word] +=1
            else:
                doc_count[word] = 1
    return doc_count    

ans=docs(corpus)
print(ans)
Kiran
  • 548
  • 6
  • 16