0

I have a list of twitter usernames containing more than 500K in number. I could develop a program that uses twython and API secret keys. The program and Inputs are too large to put here hence uploaded in the Github

Twitter_User_Geolocation

The program runs fine for usernames around 150 in number but not more than that. The limitation makes it impossible to scrape geo locations for the 500K+ usernames.

I am seeking some help in bypassing the API and may be use web scraping technique or any other better alternative to scrape geo locations of usernames.

Every Help Appreciated :)

Sitz Blogz
  • 1,061
  • 6
  • 30
  • 54
  • Are you working with the REST API, or the Streaming API? I'm not positive about the limitations of the REST API, but you can simply request the geo-location through the streaming API. – Tristen Apr 18 '17 at 20:14
  • I m using twython.. You can have a look on code n help me with solution please.. – Sitz Blogz Apr 19 '17 at 02:18
  • in your code you're calling also google map API, make sure you respect the limitations set by the APIs . https://developers.google.com/maps/documentation/geocoding/usage-limits – Merouane Benthameur Apr 19 '17 at 06:26
  • @MerouaneBenthameur Exactly my concern to eliminate APIs and use better web scraping technique or any alternative. – Sitz Blogz Apr 19 '17 at 06:31

1 Answers1

2

What I would do is scrap twitter.com/ instead of using twitter API.

The main reason is frontend is not query limited (or at least way less limited) and even if you needs to call twitter too much time by seconds, you can play with User-Agent and proxy to not be spotted.

So for me, scrapping is the easiest way to bypass API limit.

Moreover what you need to crawl is really easy to access, I made a simple'n'dirty code that parse your csv file and output location of users.

I will make a PR on your repo for fun, but here is the code:

#!/usr/env/bin python

import urllib2
from bs4 import BeautifulSoup

with open('00_Trump_05_May_2016.csv', 'r') as csv:
    next(csv)
    for line in csv:
        line = line.strip()

        permalink = line.split(',')[-1].strip()
        username  = line.split(',')[0]
        userid    = permalink.split('/')[3]

        page_url = 'http://twitter.com/{0}'.format(userid)

        try:
            page = urllib2.urlopen(page_url)
        except urllib2.HTTPError:
            print 'ERROR: username {} not found'.format(username)
        content = page.read()
        html = BeautifulSoup(content)
        location = html.select('.ProfileHeaderCard-locationText')[0].text.strip()

        print 'username {0} ({1}) located in {2}'.format(username, userid, location)

Output:

username cenkuygur (cenkuygur) located in Los Angeles
username ilovetrumptards (ilovetrumptards) located in 
username MorganCarlston hanifzk (MorganCarlston) located in 
username mitchellvii (mitchellvii) located in Charlotte, NC
username MissConception0 (MissConception0) located in #UniteBlue in Semi-Red State
username HalloweenBlogs (HalloweenBlogs) located in Los Angeles, California
username bengreenman (bengreenman) located in Fiction and Non-Fiction Both
...

Obviously you should update this code to make it more robust, but the basics are done.

PS: I parse 'permalink' field because it store well formatted slug to use in order to reach profil's page. It's pretty dirty, but quick & it works


About google API, I surelly would use a kind of cache / database to avoid to much google calls.

In python, without db you can just make a dict like:

{
   "San Fransisco": [x.y, z.a],
   "Paris": [b.c, d.e],
}

And for each location to parse I would first check in this dict if key exists, if yes just take my value from here, else call google API and then save values in db dict.


I think with this two ways of doing you will be able to get your data.

Arount
  • 9,853
  • 1
  • 30
  • 43
  • Thank you so much.. Can I request one thing please.. Could you help me by putting all the dataframe of input into output along with these new columns.. This way the code will be ready to use.. Please.. :-) – Sitz Blogz Apr 19 '17 at 09:13
  • tell me more in the github PR please. I never used dataframes, so I will see. But I think the stackoverflow part is done, futur users have their awnser. – Arount Apr 19 '17 at 09:20