-1

I want to write a script that uses dictionaries to get the tf:idf (ratio?).

The idea is to have the script find all .txt files in a directory and its sub directories by using os.walk:

files = []
for root, dirnames, filenames in os.walk(directory):
    for filename in fnmatch.filter(filenames, '*.txt'):
        files.append(os.path.join(root, filename))

it then uses the list to find all the words and how many times they appear:

def word_sort(filename3):
    with open(filename3) as f3:
        passage = f3.read()
    stop_words = "THE OF A TO AND IS IN YOU THAT IT THIS YOUR AS AN BUT FOR".split()
    words = re.findall(r'\w+', passage)
    cap_words = [word.upper() for word in words if word.upper() not in stop_words]
    word_sort = Counter(cap_words)
    return word_sort

term_freq_per_file = {}
for file in files:
    term_freq_per_file[file] = (word_sort(file))

It ends up with a dictionary like such:

 '/home/seb/Learning/ex15_sample.txt': Counter({'LOTS': 2, 'STUFF': 2, 'HAVE': 1,
                                     'I': 1, 'TYPED': 1, 'INTO': 1, 'HERE': 1,
                                      'FILE': 1, 'FUN': 1, 'COOL': 1,'REALLY': 1}),

In my mind this gives me the word frequency per file.

How do I go about finding the actual tf?

And how would I find idf?

By tf i mean the Term Frequency, it is how many times a word (term) appears in a document

TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).

And by idf i mean the Inverse Document Frequency, where Document Frequency is in how many documents the word appears

IDF(t) = log_e(Total number of documents / Number of documents with term t in it).

To clarify, my question is how do I extract those values and put them into a formula, I know they are there but I don't know how to withdraw them and use them further.


I have decided to make another dictionary that holds in what files the word has been used, as such:

{word : (file1, file2, file3)}

by iterating through the first dictionary like this:

for file in tfDic:
     word = tfDic[file][Counter]
     for word in tfDic:
        if word not in dfDic.keys():
            dfDic.setdefault(word,[]).append(file)
        if word in dfDic.keys():
            dfDic[word].append(file)

the problem is with this line:

word = tfDic[file][Counter]

I thought it will 'navigate' it to the word, however I have noticed that the words are Keys in the Counter Dictionary that is a value of the tfDic (the file).

My question is, how to I tell it to iterate through the words (keys of the 'Counter' Dictionary)?

Sebastian
  • 141
  • 3
  • 13
  • 1
    You can make this clearer by explaining what you expect `tf` and `idf` to be, and what they mean to you... – Jon Clements Aug 27 '14 at 14:01
  • are they weighted by certain words? – Padraic Cunningham Aug 27 '14 at 14:03
  • You already have "Number of times term t appears in a document", "Total number of documents", and "Number of documents with term t in it", by looking at the dict. So is your question "How do I get the total number of terms in a document?"? – Kevin Aug 27 '14 at 14:15
  • Your definition of tf is wrong: tf is just the frequency of a term in a document. So you already have tf. idf is a matter of counting (a single loop will do this) and applying the formula. – Fred Foo Aug 27 '14 at 14:27
  • `tf * idf` is a product. The `i` in `idf` stands for "inverse" so it can also be expressed as a ratio `tf / df`. – tripleee Aug 27 '14 at 14:28
  • You appear to be following http://www.tfidf.com/ but did you click through to the Python implementation at http://code.google.com/p/tfidf/source/browse/trunk/tfidf.py as well? – tripleee Aug 27 '14 at 14:39
  • One of the features of tf-idf is that you should not need to maintain a stop-word list -- the words with a high Document Frequency will be naturally given a high divisor by the algorithm. – tripleee Aug 27 '14 at 14:43
  • You've not correctly defined the tf and idf. Have a look to this formulation in order to improve your algorithm: http://nlp.stanford.edu/IR-book/html/htmledition/term-frequency-and-weighting-1.html – Alessandro Suglia Aug 29 '14 at 10:39
  • @tripleee I'm aware of the stop words, it is just a part of code I have used for something else and have yet not cleaned it up :) – Sebastian Aug 29 '14 at 10:53

3 Answers3

0

If you want to stick with your current data structure, you have to delve through the entire structure for each file for each word in order to calculate its idf.

# assume the term you are looking for is in the variable term
df = 0
for file in files:
    if term in term_freq_per_file[file]:
        df += 1
idf = math.log(len(files)/df)

An earlier version of this answer contained a sketch for an alternative data structure, but this is probably good enough.

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • I have replaced my answer with a completely different one. Please refresh. – tripleee Aug 28 '14 at 10:53
  • You might want to remove the now-obsolete comment, like I have done with mine. (Click the little grey X on the right which is visible when you hover over it.) – tripleee Aug 28 '14 at 10:54
  • Thank you, how do I know what to put after "if"? I get an error saying name 'term' is not defined. This is one of the things that confuses me. Do I need to change my function so it 'reacts' to "term" or "word"? – Sebastian Aug 28 '14 at 11:24
  • As the comment explains already, the variable should contain the word you want to calculate the tf*idf value for. – tripleee Aug 28 '14 at 13:13
  • I have updated and edited my question a bit, could you have a look at it please? – Sebastian Aug 29 '14 at 09:55
  • My suggestion would be to roll back your edit (there's a button for that in the [edit history](http://stackoverflow.com/posts/25529141/revisions)) and post the new content as a separate, new question instead. Include a link to this question. Consider accepting my answer, or posting one of your own and accepting that so that this question no longer shows up as unresolved. – tripleee Aug 29 '14 at 10:20
  • 1
    I have just solved the problem now, will post it and end this question, thanks a lot for all of your help and time – Sebastian Aug 29 '14 at 10:32
0

(finally)

I decided to go back and change my word count formula, so that instead of:

word_sort = Counter(cap_words)

I have iterated through the words in a list and made my own dictionary with how many times they appear:

word_sort = {}
for term in cap_words:
    word_sort[term] = cap_words.count(term)

so instead of having a sub dictionary (Counter) every time, I end up with this for tfDic:

'/home/seb/Learning/ex17output.txt': {'COOL': 1,
                                   'FILE': 1,
                                   'FUN': 1,
                                   'HAVE': 1,
                                   'HERE': 1,
                                   'I': 1,
                                   'INTO': 1,
                                   'LOTS': 2,
                                   'REALLY': 1,
                                   'STUFF': 2,
                                   'TYPED': 1},

and then I iterate through the keys of the tfDic[file] to create another Dictionary that holds the info in what files a given word has been used:

for file in tfDic:
word = tfDic[file].keys()
for word in tfDic[file]:
    if word not in dfDic.keys():
        dfDic.setdefault(word,[]).append(file)
    if word in dfDic.keys():
        dfDic[word].append(file)

and the final result is as such:

 'HERE': ['/home/seb/Learning/ex15_sample.txt',
      '/home/seb/Learning/ex15_sample.txt',
      '/home/seb/Learning/ex17output.txt'],

Now I plan on just 'extracting' the values and putting them into the formula I mentioned before.

Sebastian
  • 141
  • 3
  • 13
  • A `Counter` is just a subclass of `dict` so it has the same methods. I agree that having `Counter` in the output is a bit misleading; for your purposes, it really is just a dict, and you should ignore the `Counter` identifier. – tripleee Aug 29 '14 at 11:20
0

Unless this is a learning exercise on how tf-idf works, I'd recommend using the built-in scikit-learn classes to do this.

First, create an array of the count dictionaries for each file. Then feed in your array of count dictionaries to DictVectorizer, and then feed the output sparse matrix to TfidfTransformer

from sklearn.feature_extraction import DictVectorizer from sklearn.feature_extraction.text import TfidfTransformer dv = DictVectorizer() D = [{'foo': 1, 'bar': 2}, {'foo': 3, 'baz': 1}] X = dv.fit_transform(D) tv = TfidfTransformer() tfidf = tv.fit_transform(X) print(tfidf.to_array())

Ryan Ginstrom
  • 13,915
  • 5
  • 45
  • 60