3

first time posting in stack - always found previous questions capable enough of solving my prob! Main problem I have is the logic... even a pseudo code answer would be great.

I'm using python to read in data from each line of a text file, in the format:

This is a tweet captured from the twitter api #hashtag http://url.com/site

Using nltk, I can tokenize by line then can use reader.sents() to iterate through etc:

reader = TaggedCorpusReader(filecorpus, r'.*\.txt', sent_tokenizer=Line_Tokenizer())

reader.sents()[:10]

But I would like to count the frequency of certain 'hot words' (stored in an array or similar) per line, then write them back to a text file. If I used reader.words(), i could count up the frequency of 'hot words' in the entire text, but i'm looking for the amount per line (or 'sentence' in this case).

Ideally, something like:

hotwords = (['tweet'], ['twitter'])

for each line
     tokenize into words.
     for each word in line 
         if word is equal to hotword[1], hotword1 count ++
         if word is equal to hotword[2], hotword2 count ++
     at end of line, for each hotword[index]
         filewrite count,

Also, not so worried about URL becoming broken (using WordPunctTokenizer would remove the punctuation - thats not an issue)

Any useful pointers (including pseudo or links to other similar code) would be great.

---- EDIT ------------------

Ended up doing something like this:

import nltk
from nltk.corpus.reader import TaggedCorpusReader
from nltk.tokenize import LineTokenizer
#from nltk.tokenize import WordPunctTokenizer
from collections import defaultdict

# Create reader and generate corpus from all txt files in dir.
filecorpus = 'Twitter/FINAL_RESULTS/tweetcorpus'
filereader = TaggedCorpusReader(filecorpus, r'.*\.csv', sent_tokenizer=LineTokenizer())
print "Reader accessible." 
print filereader.fileids()

#define hotwords
hotwords = ('cool','foo','bar')

tweetdict = []

for line in filereader.sents():
wordcounts = defaultdict(int)
    for word in line:
        if word in hotwords:
            wordcounts[word] += 1
    tweetdict.append(wordcounts)

Output is:

print tweetdict

[defaultdict(<type 'dict'>, {}),
 defaultdict(<type 'int'>, {'foo': 2, 'bar': 1, 'cool': 2}),
 defaultdict(<type 'int'>, {'cool': 1})]
bhalsall
  • 35
  • 1
  • 6

3 Answers3

4
from collections import Counter

hotwords = ('tweet', 'twitter')

lines = "a b c tweet d e f\ng h i j k   twitter\n\na"

c = Counter(lines.split())

for hotword in hotwords:
    print hotword, c[hotword]

This script works for python 2.7+

razpeitia
  • 1,947
  • 4
  • 16
  • 36
  • Also you use `most_common` like `c.most_common(10)` to get the 10 most common words in the counter. – razpeitia Apr 08 '11 at 13:48
  • I was going to suggest using a dictionary {String word:int count} like @Daniel Roseman, but this looks much sleeker. – Tom Apr 08 '11 at 14:52
1

defaultdict is your friend for this sort of thing.

from collections import defaultdict
for line in myfile:
    # tokenize
    word_counts = defaultdict(int)
    for word in line:
        if word in hotwords:
            word_counts[word] += 1
    print '\n'.join('%s: %s' % (k, v) for k, v in word_counts.items())
Daniel Roseman
  • 588,541
  • 66
  • 880
  • 895
  • yep - just tweaked this a little but the logic is great - preferred this over the counter solution. Is it most efficient to create a defaultdict for each line in the textfile? – bhalsall Apr 08 '11 at 15:35
  • @bhalsall: you could call `word_counts.clear()` after each line instead of creating a new defaultdict each time. – jfs Apr 09 '11 at 10:13
0

Do you need to tokenize it? You can use count() on each line for each of your words.

hotwords = {'tweet':[], 'twitter':[]}
for line in file_obj:
    for word in hotwords.keys():
        hotwords[word].append(line.count(word))
nmichaels
  • 49,466
  • 12
  • 107
  • 135
  • You'll end up counting sub strings otherwise. if hotword == 'sex', I don't want Middlesex being counted – Steve Mayne Apr 08 '11 at 13:27
  • thats the right kind of thing though. Ideally I need to 're-tokenize' each line into words. I can't just tokenize into words from the start because then I don't preserve the new line delimiter (which is what separates each tweet)... and I end up counting word frequencies for the entire text file, not per line. – bhalsall Apr 08 '11 at 13:33