1

I need to save in real-time to a database all tweets from the Twitter Streaming API, filtering them by out a certain list of words, of course. I've achieved it by using tweetstream, defining the list words like this before calling FilterStream():

words = ["word1","two words","anotherWord"]

What I'd like to do, is to be able to add/change/remove any of those values, without stoping the script. To do so, I created a plain text file containing the words I want to be filtered out separated by a line break. Using this code I get the list words just perfectly:

file = open('words.txt','r')
words = file.read().split("\n")

I made those lines work when it starts, but I need it to do it every time it's going to check the stream. Any ideas?

2 Answers2

0

Perhaps something like this will work:

def rebuild_wordlist(s):
    with open('words.txt','r') as f:
        return set(f.read().split('\n'))

def match(tweet):
    return any(w in tweet for w in words)

words, timestamp = rebuild_wordlist(), time.time()
stream = tweetstream.SampleStream("username", "password")
fstream = ifilter(match, stream)

for tweet in fstream:
    do_some_with_tweet(tweet)
    if time.time() > timestamp + 5.0:
        # refresh the wordlist every 5 seconds
        words, timestamp = rebuild_wordlist(), time.time()

The words set is a global that gets refreshed every few seconds while the filter is running.

Raymond Hettinger
  • 216,523
  • 63
  • 388
  • 485
  • Sorry, I answered after you edited the message. Let me try it out :) –  Nov 07 '11 at 01:57
  • Ok, I couldn't get that working, but I think I should be using *FilterStream* instead of *SampleStream*, because the second one its suposed just to deliver *a sample* of all tweets and I'd like to get them all, also that way I could save some traffic I guess (because SampleStream would fetch many tweets, filter them out and save the matching ones, but FilterStream would fetch many and save all those many. –  Nov 07 '11 at 02:21
0

You could read an updated word list in one thread and process tweets in another one using Queue for communication.

Example:

Thread that reads tweets:

def read_tweets(q):
    words = q.get()
    while True:
        with tweetstream.FilterStream(..track=words,..) as stream:
             for tweet in stream: #NOTE:it requires special handling if it blocks
                 process(tweet)
                 try: words = q.get_nowait() # try to read a new word list
                 except Empty: pass
                 else: break # start new connection

Thread that reads words:

def read_words(q):
    words = None
    while True:
        with open('words.txt') as file:
            newwords = file.read().splitlines()
        if words != newwords:
           q.put(newwords)
           words = newwords
        time.sleep(1)

The main script could look like:

 q = Queue(1)
 t = Thread(target=read_tweets, args=(q,))
 t.daemon = True
 t.start()
 read_words(q)

Instead of polling you could use inotify or similar to monitor changes to the 'words.txt' file.

jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • That's amazing! It works great :D I might add a time-based function in order to not get banned because connections being closed and started to quickly. Thank you, really! –  Nov 07 '11 at 06:49