0

I have several text files, which I want to compare against a vocabulary list consisting of expressions and single words. The desired output should be a dictionary containing all elements of that list as keys and their respective frequency in the textfile as value. To construct the vocabulary list I need to match two lists together,

list1 = ['accounting',..., 'yields', 'zero-bond']
list2 = ['accounting', 'actual cost', ..., 'zero-bond']
vocabulary_list = ['accounting', 'actual cost', ..., 'yields', 'zero-bond']

sample_text = "Accounting experts predict an increase in yields for zero-bond and yields for junk-bonds."

desired_output = ['accounting':1, 'actual cost':0, ..., 'yields':2, 'zero-bond':1]

what I tried:

def word_frequency(fileobj, words):
     """Build a Counter of specified words in fileobj""" 
     # initialise the counter to 0 for each word 
    ct = Counter(dict((w, 0) for w in words)) 
    file_words = (word for line in fileobj for word in line)             
    filtered_words = (word for word in file_words if word in words)       
    return Counter(filtered_words)

 def print_summary(filepath, ct): 
    words = sorted(ct.keys()) 
    counts = [str(ct[k]) for k in words] with open(filepath[:-4] + '_dict' + '.txt', mode = 'w') as outfile: 
    outfile.write('{0}\n{1}\n{2}\n\n'.format(filepath,', '.join(words),', '.join(counts))) 
    return outfile 

Is there any way to do this in Python? I figured out how to manage this with a vocabulary list of single words (1token) but couldnt figure out a solution for the multiple-word case?

Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
Dominik Scheld
  • 125
  • 2
  • 9
  • What was your single-word solution? In what way(s) did it not work for expressions? – Scott Hunter Feb 04 '15 at 15:32
  • def word_frequency(fileobj, words): """Build a Counter of specified words in fileobj""" # initialise the counter to 0 for each word ct = Counter(dict((w, 0) for w in words)) file_words = (word for line in fileobj for word in line) filtered_words = (word for word in file_words if word in words) return Counter(filtered_words) – Dominik Scheld Feb 04 '15 at 15:43
  • def print_summary(filepath, ct): words = sorted(ct.keys()) counts = [str(ct[k]) for k in words] with open(filepath[:-4] + '_dict' + '.txt', mode = 'w') as outfile: outfile.write('{0}\n{1}\n{2}\n\n'.format(filepath,', '.join(words),', '.join(counts))) return outfile – Dominik Scheld Feb 04 '15 at 15:44
  • words = vocabulary_list – Dominik Scheld Feb 04 '15 at 15:48
  • unfortunately the first function only captures single tokens, therefore it can only compare those sinlge token words againgst the vocabulary list – Dominik Scheld Feb 04 '15 at 15:51

1 Answers1

0

If you want to consider words ending with punctuation you will need to clean the text also i.e 'yields' and 'yields!'

from collections import Counter
c = Counter()
import re

vocabulary_list = ['accounting', 'actual cost','yields', 'zero-bond']
d = {k: 0 for k in vocabulary_list}
sample_text = "Accounting experts predict actual costs an increase in yields for zero-bond and yields for junk-bonds.".lower()
splitted = set(sample_text.split())
c.update(splitted) # get count of all words 

for k in d:
    spl = k.split()
    ln = len(spl)
    # if we have multiple words we cannot split
    if ln > 1:
        check = re.findall(r'\b{0}\b'.format(k),sample_text)
        if check:
            d[k] += len(check)
    # else we are looking for a single word
    elif k in splitted:
        d[k] += c[k]
print(d)

To chain all the lists into a single vocab dict:

from collections import Counter
from itertools import chain
import re

c = Counter()

l1,l2 = ['accounting', 'actual cost'], ['yields', 'zero-bond']
vocabulary_dict  = {k:0 for k in chain(l1,l2)}
print(vocabulary_dict)
sample_text = "Accounting experts predict actual costs an increase in yields for zero-bond and yields for junk-bonds.".lower()
splitted = sample_text.split()
c.update(splitted)

for k in vocabulary_dict:
    spl = k.split()
    ln = len(spl)
    if ln > 1:
        check = re.findall(r'\b{0}\b'.format(k),sample_text)
        if check:
            vocabulary_dict[k] += len(check)
    elif k in sample_text.split():
        vocabulary_dict[k] += c[k]
print(vocabulary_dict)

You could create two dicts one for phrases and the other for words and do a pass over each.

Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
  • Nice solution Padraic, but this doesnt work for a sample like this: sample_text = "accounting experts ...actual cost ... predict an increase in yields for zero-bond and yields" -> ('actual cost':0 , 'accounting':1 ... ) – Dominik Scheld Feb 04 '15 at 15:40
  • thanks a lot Padraic:) one little thing missing, the output of your script is (...'yields':1) it should be (...'yields':2) instead? – Dominik Scheld Feb 04 '15 at 15:57
  • @DominikScheld, yes need to reverse the logic one sec – Padraic Cunningham Feb 04 '15 at 15:58
  • now it works perfectly, do you also have an idea how to combine the two lists in order to construct a unique vocabulary_list? – Dominik Scheld Feb 04 '15 at 16:05
  • @DominikScheld, added an example how to chain the lists and create the dict. You could also have two dicts one for phrases and one for single words and just combine at the end removing the need to split and check the len – Padraic Cunningham Feb 04 '15 at 16:11