2

I am working with Python attempting to store tweets (more precisely only their date, user, bio and text) related to a specific keyword in a csv file. As I am working on the free-to-use API of Twitter, I am limited to 450 tweets every 15 minutes. So I have coded something which is supposed to store exactly 450 tweets in 15 minutes.

BUT the problem is something goes wrong when extracting the tweets so that at a specific point the same tweet is stored again and again.

Any help would be much appreciated !! Thanks in advance

import time
from twython import Twython, TwythonError, TwythonStreamer
twitter = Twython(CONSUMER_KEY, CONSUMER_SECRET) 

sfile = "tweets_" + keyword + todays_date + ".csv"
id_list = [last_id]  
count = 0
while count < 3*60*60*2: #we set the loop to run for 3hours

    # tweet extract method with the last list item as the max_id
    print("new crawl, max_id:", id_list[-1])
    tweets = twitter.search(q=keyword, count=2, max_id=id_list[-1])["statuses"]
    time.sleep(2) ## 2 seconds rest between api calls (450 allowed within 15min window)

    for status in tweets:
        id_list.append(status["id"]) ## append tweet id's

        if status==tweets[0]:
            continue

        if status==tweets[1]:
            date = status["created_at"].encode('utf-8')
            user = status["user"]["screen_name"].encode('utf-8') 
            bio = status["user"]["description"].encode('utf-8')
            text = status["text"].encode('utf-8')

            with open(sfile,'a') as sf:
                sf.write(str(status["id"])+ "|||" + str(date) + "|||" + str(user) + "|||" + str(bio) + "|||" + str(text)  +  "\n")

        count += 1
        print(count)
        print(date, text)
Martin Evans
  • 45,791
  • 17
  • 81
  • 97
  • I would recommend you stick to a standard comma delimiter for your CSV file. If your tweet contains a comma then the field is normally enclosed with quotes. It is also able to cope with newlines. Python's CSV library will handle all of this for you automatically. – Martin Evans Mar 08 '19 at 08:55

1 Answers1

0

You should use Python's CSV library to write your CSV files. It takes a list containing all of the items for a row and automatically adds the delimiters for you. If a value contains a comma, it automatically adds quotes for you (which is how CSV files are meant to work). It can even handle newlines inside a value. If you open the resulting file into a spreadsheet application you will see it is correctly read in.

Rather than trying to use time.sleep(), a better approach is to work with absolute times. So the idea is to take your starting time and add three hours to it. You can then keep looping until this finish_time is reached.

The same approach can be made to your API call allocations. Keep a counter holding how many calls you have left and downcount it. If it reaches 0 then stop making calls until the next fifteen minute slot is reached.

timedelta() can be used to add minutes or hours to an existing datetime object. By doing it this way, your times will never slip out of sync.

The following shows a simulation of how you can make things work. You just need to add back your code to get your Tweets:

from datetime import datetime, timedelta
import time
import csv
import random   # just for simulating a random ID

fifteen = timedelta(minutes=15)
finish_time = datetime.now() + timedelta(hours=3)

calls_allowed = 450
calls_remaining = calls_allowed

now = datetime.now()
next_allocation = now + fifteen

todays_date = now.strftime("%d_%m_%Y")
ids_seen = set()

with open(f'tweets_{todays_date}.csv', 'w', newline='') as f_output:
    csv_output = csv.writer(f_output)

    while now < finish_time:
        time.sleep(2)
        now = datetime.now()

        if now >= next_allocation:
            next_allocation += fifteen
            calls_remaining = calls_allowed
            print("New call allocation")

        if calls_remaining:
            calls_remaining -= 1
            print(f"Get tweets - {calls_remaining} calls remaining")

            # Simulate a tweet response
            id = random.choice(["1111", "2222", "3333", "4444"])    # pick a random ID
            date = "01.01.2019"
            user = "Fred"
            bio = "I am Fred"
            text = "Hello, this is a tweet\nusing a comma and a newline."

            if id not in ids_seen:
                csv_output.writerow([id, date, user, bio, text])
                ids_seen.add(id)

As for the problem of keep writing the same Tweets. You could use a set() to hold all of the IDs that you have written. You could then test if a new tweet has already been seen before writing it again.

Martin Evans
  • 45,791
  • 17
  • 81
  • 97