I want to write a script that uses dictionaries to get the tf:idf (ratio?).
The idea is to have the script find all .txt files in a directory and its sub directories by using os.walk:
files = []
for root, dirnames, filenames in os.walk(directory):
for filename in fnmatch.filter(filenames, '*.txt'):
files.append(os.path.join(root, filename))
it then uses the list to find all the words and how many times they appear:
def word_sort(filename3):
with open(filename3) as f3:
passage = f3.read()
stop_words = "THE OF A TO AND IS IN YOU THAT IT THIS YOUR AS AN BUT FOR".split()
words = re.findall(r'\w+', passage)
cap_words = [word.upper() for word in words if word.upper() not in stop_words]
word_sort = Counter(cap_words)
return word_sort
term_freq_per_file = {}
for file in files:
term_freq_per_file[file] = (word_sort(file))
It ends up with a dictionary like such:
'/home/seb/Learning/ex15_sample.txt': Counter({'LOTS': 2, 'STUFF': 2, 'HAVE': 1,
'I': 1, 'TYPED': 1, 'INTO': 1, 'HERE': 1,
'FILE': 1, 'FUN': 1, 'COOL': 1,'REALLY': 1}),
In my mind this gives me the word frequency per file.
How do I go about finding the actual tf?
And how would I find idf?
By tf i mean the Term Frequency, it is how many times a word (term) appears in a document
TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).
And by idf i mean the Inverse Document Frequency, where Document Frequency is in how many documents the word appears
IDF(t) = log_e(Total number of documents / Number of documents with term t in it).
To clarify, my question is how do I extract those values and put them into a formula, I know they are there but I don't know how to withdraw them and use them further.
I have decided to make another dictionary that holds in what files the word has been used, as such:
{word : (file1, file2, file3)}
by iterating through the first dictionary like this:
for file in tfDic:
word = tfDic[file][Counter]
for word in tfDic:
if word not in dfDic.keys():
dfDic.setdefault(word,[]).append(file)
if word in dfDic.keys():
dfDic[word].append(file)
the problem is with this line:
word = tfDic[file][Counter]
I thought it will 'navigate' it to the word, however I have noticed that the words are Keys in the Counter Dictionary that is a value of the tfDic (the file).
My question is, how to I tell it to iterate through the words (keys of the 'Counter' Dictionary)?