1

I wrote this code to get the full list of twitter account followers using Tweepy:

# ... twitter connection and streaming

fulldf = pd.DataFrame()
line = {}
ids = []
try:
    for page in tweepy.Cursor(api.followers_ids, screen_name="twittername").pages():
        df = pd.DataFrame()
        ids.extend(page)
        try:
            for i in ids:
                user = api.get_user(i)

                line = [{'id': user.id, 
                'Name': user.name, 
                'Statuses Count':user.statuses_count,
                'Friends Count': user.friends_count,
                'Screen Name':user.screen_name,
                'Followers Count':user.followers_count,
                'Location':user.location,
                'Language':user.lang,
                'Created at':user.created_at,
                'Time zone':user.time_zone,
                'Geo enable':user.geo_enabled,
                'Description':user.description.encode(sys.stdout.encoding, errors='replace')}]
                df = pd.DataFrame(line)
                fulldf = fulldf.append(df)
                del df
                fulldf.to_csv('out.csv', sep=',', index=False)
                print i ,len(ids)
        except tweepy.TweepError:
            time.sleep(60 * 15)
            continue
except tweepy.TweepError as e2:
    print "exception global block"
    print e2.message[0]['code']  
    print e2.args[0][0]['code'] 

At the end I have only 1000 line in the csv file, It's not best solution to save everything on memory (dataframe) and save it to file in the same loop. But at least I have something that works but not getting the full list just 1000 out of 15000 followers.

Any help with this will be appreciated.

lazurens
  • 35
  • 1
  • 6
  • By some chance is `"exception global block"` printing? – asongtoruin Jun 14 '17 at 08:39
  • Yes, I am not expert, so I just want to know where it occurs. But that was not the problem, the problem in my opinion in saving data into file. – lazurens Jun 14 '17 at 13:45
  • I think it's to do with the way you have tried to catch the errors. I'll take a look at it this evening, if you have no answer by then. – asongtoruin Jun 14 '17 at 13:47
  • Thank you for doing that, and I am keep trying some investigations maybe will come up with a solution. Thanks! – lazurens Jun 14 '17 at 15:28

1 Answers1

4

Consider the following part of your code:

for page in tweepy.Cursor(api.followers_ids, screen_name="twittername").pages():
        df = pd.DataFrame()
        ids.extend(page)
        try:
            for i in ids:
                user = api.get_user(i)

As you use extend for each page, you simply add the new set of ids onto the end of your list of ids. The way you have nested your for statements means that with every new page you return, you get_user for all of the previous pages first - as such, when you hit the final page of ids you'd still be looking at the first 1000 or so when you hit the rate limit and have no more pages to browse. You're also likely hitting the rate limit for your cursor, hich would be why you're seeing the exception.

Let's start over a bit.

Firstly, tweepy can deal with rate limits (one of the main error sources) for you when you create your API if you use wait_on_rate_limit. This solves a whole bunch of problems, so we'll do that.

Secondly, if you use lookup_users, you can look up 100 user objects per request. I've written about this in another answer so I've taken the method from there.

Finally, we don't need to create a dataframe or export to a csv until the very end. If we get a list of user information dictionaries, this can quickly change to a DataFrame with no real effort from us.

Here is the full code - you'll need to sub in your keys and the username of the user you actually want to look up, but other than that it hopefully will work!

import tweepy
import pandas as pd

def lookup_user_list(user_id_list, api):
    full_users = []
    users_count = len(user_id_list)
    try:
        for i in range((users_count / 100) + 1):
            print i
            full_users.extend(api.lookup_users(user_ids=user_id_list[i * 100:min((i + 1) * 100, users_count)]))
        return full_users
    except tweepy.TweepError:
        print 'Something went wrong, quitting...'

consumer_key = 'XXX'
consumer_secret = 'XXX'
access_token = 'XXX'
access_token_secret = 'XXX'

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

ids = []
for page in tweepy.Cursor(api.followers_ids, screen_name="twittername").pages():
    ids.extend(page)

results = lookup_user_list(ids, api)
all_users = [{'id': user.id,
             'Name': user.name,
             'Statuses Count': user.statuses_count,
             'Friends Count': user.friends_count,
             'Screen Name': user.screen_name,
             'Followers Count': user.followers_count,
             'Location': user.location,
             'Language': user.lang,
             'Created at': user.created_at,
             'Time zone': user.time_zone,
             'Geo enable': user.geo_enabled,
             'Description': user.description}
             for user in results]

df = pd.DataFrame(all_users)

df.to_csv('All followers.csv', index=False, encoding='utf-8')
asongtoruin
  • 9,794
  • 3
  • 36
  • 47
  • I edited my code and used the optimization you sugggest and I am running the script now, it's seems a good solution. Thank you for your effort @asongtoruin and for the time you invested to improve the code. I appreciate it. – lazurens Jun 14 '17 at 21:01
  • @lazurens no worries pal! You can mark it as the answer to your question with the tick to the left of the answer if you found it useful – asongtoruin Jun 14 '17 at 21:43
  • It's indeed my answer, and it fixed everything I don't have enough reputation to click thubms up. Thank you again. – lazurens Jun 16 '17 at 05:54