2

I got this tfidf from yebrahim and somehow my output document yield all 0 for the result . Any problem with this ? example of the output is hippo 0.0 hipper 0.0 hip 0.0 hint 0.0 hindsight 0.0 hill 0.0 hilarious 0.0

thanks for the help

    # increment local count
    for word in doc_words:
        if word in terms_in_doc:
            terms_in_doc[word] += 1
        else:
            terms_in_doc[word]  = 1

    # increment global frequency
     for (word,freq) in terms_in_doc.items():
        if word in global_term_freq:
            global_term_freq[word] += 1
        else:
            global_term_freq[word]  = 1

     global_terms_in_doc[f] = terms_in_doc

print('working through documents.. ')
for f in all_files:

    writer = open(f + '_final', 'w')
    result = []
    # iterate over terms in f, calculate their tf-idf, put in new list
    max_freq = 0;
    for (term,freq) in global_terms_in_doc[f].items():
        if freq > max_freq:
            max_freq = freq
    for (term,freq) in global_terms_in_doc[f].items():
        idf = math.log(float(1 + num_docs) / float(1 + global_term_freq[term]))
        tfidf = float(freq) / float(max_freq) * float(idf)
        result.append([tfidf, term])

    # sort result on tfidf and write them in descending order
    result = sorted(result, reverse=True)
    for (tfidf, term) in result[:top_k]:
        if display_mode == 'both':
            writer.write(term + '\t' + str(tfidf) + '\n')
        else:
            writer.write(term + '\n')
mpenkov
  • 21,621
  • 10
  • 84
  • 126
  • You're going to have to isolate the part that's giving you problems. That's a lot of code to go through, and you appear to be using a 3rd party library to do the tokenizing. It would help if you mentioned/included that part as well. – Joel Cornett Apr 22 '13 at 04:45
  • Hi, sorry for that, just edited, can help me now ? – user2106416 Apr 22 '13 at 05:00
  • What happens if you put `assert tfidf == 0.0, term` in your final `for` loop? – Joel Cornett Apr 22 '13 at 05:19
  • i get an error, invalid syntax. – user2106416 Apr 22 '13 at 05:32
  • The `SyntaxError` will also tell you the line number, and position at which the parser broke on syntax. That should help you properly place the above statement. – Joel Cornett Apr 22 '13 at 05:38
  • 1
    Since you appear to be having a problem with code that you didn't write yourself, it may be worth trying to get in touch with the author of the original code. – mpenkov Apr 22 '13 at 10:26

1 Answers1

3

The output of tf-idf obviously depends on you counting the terms correctly. If you get this wrong, then the results will be unexpected. You may want to output the raw counts for each word to verify this. For example, how many times does the word "hipp" appear in the current document, and in the entire collection?

Some other pointers:

  • Instead of using explicit floats for division, use from __future__ import division. It makes your code more readable.
  • Use collections.defaultdict for combining a dictionary with a counter. This avoids having to check if a value is already present before incrementing it. If you dislike defaultdict, then use a try-catch block -- it's faster than using the if statement.
  • Don't iterate over the items() of a dictionary. It creates an entire new list of (key, value) pairs and carries a hefty computational and storage complexity penalty. Iterate over the keys of the dictionary (for k in some_dictionary) and use normal indexing to access the values (some_dictionary[k]).
  • You don't need a for loop to calculate the maximum of a list in Python.

The above pointers may not solve your problem directly, but they will make your code easier to read and understand (for both you and people on SO), making it easier to locate and resolve problems.

mpenkov
  • 21,621
  • 10
  • 84
  • 126
  • The division by the max. frequency of any term in *d* is in fact the third alternative definition of tf listed just below the part you cited. – Fred Foo Apr 22 '13 at 10:31
  • Good catch. After reading through the code again, I realized the problems I pointed out weren't really problems at all. I've updated my answer. – mpenkov Apr 22 '13 at 11:02