-1

i have a list "total_vocabulary" with all the unique words in a collection of 56 documents. There is another list of list with words of every document "rest_doc". I want to calculate term frequency of each word from "total_vocabulary" in "rest_doc" so "term_freq" list will be a list of list of the same size of total_vocabulary and at each index of term_freq will be a list of size 56 representing the total occurrence of each word in each document. The problem is that the nested for loops are taking so much time,almost a minute to run. is there any way to do it faster? code:

for i in range(len(total_vocabulary)):
    doc = []
    for j in range(len(rest_doc)):
        counter = 0
        for k in range(len(rest_doc[j])):
            if total_vocabulary[i] == rest_doc[j][k]:
                counter = counter + 1
        doc.append(counter)
    term_freq.append(doc)  

here is my code.

1 Answers1

0

You're iterating over the words in each document many times -- once per each word in total_vocabulary.

It would be much faster if you iterated over the words in each document just once, and you can do that by rearranging the loop and also making total_vocabulary into a set instead of a list, as set lookups are much faster.

vocab_set = set(total_vocabulary)
for document in documents:
    for word in document:
        if word in vocab_set:
            counter = counter + 1
John Gordon
  • 29,573
  • 7
  • 33
  • 58
  • My main problem is iterating through list of lists of all documents. for example if a word is "believe", then it will have different results of it if it is repeated in different document. I need the term frequency of every word in total_vocabulary in all the documents where the total occurrence of that word in each document is separated. This is not getting the job done. sorry – unaizhaider Mar 31 '20 at 09:26
  • Your question said that the problem was speed, and I addressed that in my answer. If my answer is unsuitable due to unstated requirements on your part, that's on you. – John Gordon Mar 31 '20 at 14:53
  • Your answer doesnt calculate term frequency correctly, what will i do with speed if my output is incorrect. – unaizhaider Mar 31 '20 at 17:27