0

i'm trying to download twitter followers from a list of accounts. my function (that uses twython) works pretty well for short account lists but rise an error for longer lists. it is not a RateLimit problem since my function sleeps until the next time bin if the rate limit is hit. the error is this

twythonerror: ('Connection aborted.', error(10054, ''))

others seem to have the same problem and the proposed solution is to make the function sleep between different REST API calls so i implemented the following code

    del twapi
    sleep(nap[afternoon])
    afternoon = afternoon + 1
    twapi = Twython(app_key=app_key, app_secret=app_secret,
                oauth_token=oauth_token, oauth_token_secret=oauth_token_secret)

nap is a list of intervals in seconds and afternoon is an index. despite this suggestion i still have the exact same problem. it seems that the sleep doesen't resolve the problem. can anyone help me?

here is the whole finction

def download_follower(serie_lst):
    """Creates account named txt files containing followers ids. Uses for loop on accounts names list."""
    nap = [1, 2, 4, 8, 16, 32, 64, 128]    
    afternoon = 0

    for exemplar in serie_lst:

        #username from serie_lst entries
        account_name = exemplar

        twapi = Twython(app_key=app_key, app_secret=app_secret,
                        oauth_token=oauth_token, oauth_token_secret=oauth_token_secret)

        try:
            #initializations
            del twapi
            if afternoon >= 7:
                afternoon =0

            sleep(nap[afternoon])
            afternoon = afternoon + 1
            twapi = Twython(app_key=app_key, app_secret=app_secret,
                        oauth_token=oauth_token, oauth_token_secret=oauth_token_secret)
            next_cursor = -1
            result = {}
            result["screen_name"] = ""
            result["followers"] = []
            iteration = 0
            file_name = ""

            #user info
            user = twapi.lookup_user(screen_name = account_name)

            #store user name
            result['screen_name'] = account_name

            #loop until all cursored results are stored
            while (next_cursor != 0):
                sleep(random.randrange(start = 1, stop = 15, step = 1))
                call_result = twapi.get_followers_ids(screen_name = account_name, cursor = next_cursor)
                #loop over each entry of followers id and append each     entry to results_follower    
                for i in call_result["ids"]:
                    result["followers"].append(i)
                next_cursor = call_result["next_cursor"] #new next_cursor
                iteration = iteration + 1
                if (iteration > 13): #skip sleep if all cursored pages are processed
                    error_msg = localtime()
                    error_msg = "".join([str(error_msg.tm_mon), "/", str(error_msg.tm_mday), "/", str(error_msg.tm_year), " at ", str(error_msg.tm_hour), ":", str(error_msg.tm_min)])
                    error_msg ="".join(["Twitter API Request Rate Limit hit on ", error_msg, ", wait..."])
                    print(error_msg)
                    del error_msg
                    sleep(901) #15min + 1sec
                    iteration = 0

            #output file
            file_name = "".join([account_name, ".txt"])

            #print output
            out_file = open(file_name, "w") #open file "account_name.txt"
            #out_file.write(str(result["followers"])) #standard format
            for i in result["followers"]: #R friendly table format
                out_file.write(str(i))
                out_file.write("\n")
            out_file.close()

        except twython.TwythonRateLimitError:
            #wait
            error_msg = localtime()
            error_msg = "".join([str(error_msg.tm_mon), "/", str(error_msg.tm_mday), "/", str(error_msg.tm_year), " at ", str(error_msg.tm_hour), ":", str(error_msg.tm_min)])
            error_msg ="".join(["Twitter API Request Rate Limit hit on ", error_msg, ", wait..."])
            print(error_msg)
            del error_msg
            del twapi
            sleep(901) #15min + 1sec

            #initializations
            if afternoon >= 7:
                afternoon =0

            sleep(nap[afternoon])
            afternoon = afternoon + 1
            twapi = Twython(app_key=app_key, app_secret=app_secret,
                        oauth_token=oauth_token, oauth_token_secret=oauth_token_secret)
            next_cursor = -1
            result = {}
            result["screen_name"] = ""
            result["followers"] = []
            iteration = 0
            file_name = ""

            #user info
            user = twapi.lookup_user(screen_name = account_name)

            #store user name
            result['screen_name'] = account_name

            #loop until all cursored results are stored
            while (next_cursor != 0):
                sleep(random.randrange(start = 1, stop = 15, step = 1))
                call_result = twapi.get_followers_ids(screen_name = account_name, cursor = next_cursor)
                #loop over each entry of followers id and append each entry to results_follower    
                for i in call_result["ids"]:
                    result["followers"].append(i)
                next_cursor = call_result["next_cursor"] #new next_cursor
                iteration = iteration + 1
                if (iteration > 13): #skip sleep if all cursored pages are processed
                    error_msg = localtime()
                    error_msg = "".join([str(error_msg.tm_mon), "/", str(error_msg.tm_mday), "/", str(error_msg.tm_year), " at ", str(error_msg.tm_hour), ":", str(error_msg.tm_min)])
                    error_msg = "".join(["Twitter API Request Rate Limit hit on ", error_msg, ", wait..."])
                    print(error_msg)
                    del error_msg
                    sleep(901) #15min + 1sec
                    iteration = 0

            #output file
            file_name = "".join([account_name, ".txt"])

            #print output
            out_file = open(file_name, "w") #open file "account_name.txt"
            #out_file.write(str(result["followers"])) #standard format
            for i in result["followers"]: #R friendly table format
                out_file.write(str(i))
                out_file.write("\n")
            out_file.close()
mbiella
  • 51
  • 6
  • What are the values in `nap`? What is the initial value of `afternoon`? You need to provide some more context for this to be understandable. – asongtoruin Feb 20 '17 at 18:19
  • nap = [1,2,4,8,16,32,64,128] and afternoon is initialized at 0 and set back at 0 when needed. that part is checked, the problem is that despite the program sleeps between each call the server keeps closing the connection – mbiella Feb 21 '17 at 09:45
  • Why are you using such short rests? If it is a rate limit issue then these values probably wouldn't be long enough to get into the next window if, [as it seems](https://dev.twitter.com/rest/public/rate-limits), limits are per 15 minute period. – asongtoruin Feb 21 '17 at 09:56
  • Also, why are you deleting your connection every few seconds? You should be able to leave the connection open but wait to make your next request, I think. – asongtoruin Feb 21 '17 at 10:00
  • it is not a RateLimit problem. to avoid that my function sleeps for 900 sec (15 min). i had RateLimit problems but i already resolved it. this time is a different kind of issue. probably, twitter server consider my calls as denial of service attack so i make my function sleep for different time intervals and i delete the connection for the same reason (as suggested here --> http://stackoverflow.com/questions/27333671/how-to-solve-the-10054-error ) – mbiella Feb 21 '17 at 14:28
  • Can you provide more of your code? It's difficult to work out what's going on from this small section. – asongtoruin Feb 21 '17 at 14:36
  • @asongtoruin i added the whol function. i know it looks bad but i'm pretty new to the python world. thank you for your help! – mbiella Feb 21 '17 at 16:43
  • Did you accidentally replicate your code in copying (`#initialisations` onwards) or is this actually part of what you've written? – asongtoruin Feb 21 '17 at 17:13
  • Also, are you certain about using `Twython`? I think `Tweepy` deals with cursors better, and I might be able to help you better with it. – asongtoruin Feb 21 '17 at 17:24
  • i didn't accidentally replicate initializations, they are actually in the code. i know, it is not elegant at all! i choose to use twython just because i use it since the beginning, i don't know the differences between twython and tweepy. anyway i didn't get why even if my function sleeps i keep having that fu**ing error! i'll cry all day long! – mbiella Feb 22 '17 at 09:28
  • I think the replication of your initialisation might be why it errors out even if it sleeps - once it hits the `RateLimitError` for the first time, there's no catch on it. I've worked up a solution in `Tweepy` that I'm just testing now - I'll let you know if it works if this will help? – asongtoruin Feb 22 '17 at 09:43
  • thanks a lot!! your solution will surely be helpful! let me know. – mbiella Feb 22 '17 at 09:56
  • Are you after follower usernames or just IDs? – asongtoruin Feb 22 '17 at 10:03
  • by now just IDs. if needed i'll manage usernames by my self later on... – mbiella Feb 22 '17 at 10:56

1 Answers1

0

As discussed in the comments, there are a few issues with your code at present. You shouldn't need to delete your connection for it to function properly, and I think the issue comes because you initialise for a second time without having any catches for hitting your rate limit. Here is an example using Tweepy of how you can get the information you require:

import tweepy
from datetime import datetime


def download_followers(user, api):
    all_followers = []
    try:
        for page in tweepy.Cursor(api.followers_ids, screen_name=user).pages():
            all_followers.extend(map(str, page))
        return all_followers
    except tweepy.TweepError:
        print('Could not access user {}. Skipping...'.format(user))

# Include your keys below:
consumer_key = 'YOUR_KEY'
consumer_secret = 'YOUR_KEY'
access_token = 'YOUR_KEY'
access_token_secret = 'YOUR_KEY'

# Set up tweepy API, with handling of rate limits
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
main_api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

# List of usernames to get followers for
lookup_users = ['asongtoruin', 'mbiella']

for username in lookup_users:
    user_followers = download_followers(username, main_api)
    if user_followers:
        with open(username + '.txt', 'w') as outfile:
            outfile.write('\n'.join(user_followers))
        print('Finished outputting: {} at {}'.format(username, datetime.now().strftime('%Y/%m/%d %H:%M:%S')))

Tweepy is clever enough to know when it has hit its rate limit when we use wait_on_rate_limit=True, and checks how long it needs to sleep for before it can start again. By using wait_on_rate_limit_notify=True, we allow it to paste out how long it will be waiting until it can next get a page of followers (through this ID-based method, it seems as though there are 5000 IDs per page).

We additionally catch a TweepError exception - this can occur if the username provided relates to a protected account for which our authenticated user does not have permission to view. In this case, we simply skip the user to allow other information to be downloaded, but print out a warning that the user could not be accessed.

Running this saves a text file of follower ids for any user it can access. For me this prints the following:

Rate limit reached. Sleeping for: 593
Finished outputting: asongtoruin at 2017/02/22 11:43:12
Could not access user mbiella. Skipping...

With the follower IDs of asongtoruin (aka me) saved as asongtoruin.txt

There is one possible issue, in that our pages of followers start from the newest first. This could (though I don't understand the API well enough to say with certainty) result in issues with our output dataset if new users are added between our calls, as we may both miss these users and end up with duplicates in our dataset. If duplicates become an issue, you could change return all_followers to return set(all_followers)

asongtoruin
  • 9,794
  • 3
  • 36
  • 47