Getting address with geopy takes too long

Question

In the context of a project, I have hydrated 1.6 million tweets, i.e retrieved the metadata associated with the tweets such as date of creation, and location.

My tweet dataset contains tweets from all over the world, however, I am only interested in tweets created in the US. Also, I want to create some statistics by state, and since most of the locations associated with the tweets are wrong or not formalised, I need to formalise them before I do so.

Here are the kind of locations that I have: ['ÌïúÍµ≠Ïñ¥ Í∞ïÏ†ú ÏàòÏö©ÏÜå (DPRK)', 'Lagos, Nigeria', 'Kolkata, India', 'Who cares', 'Unknown', 'British Columbia, Canada', 'Bitcoin & Markets', 'White Plains, NY', 'Washington, DC']

I was able to create a code that filters all these locations and formalises them, but the problem is that it takes way too long (2.03it/s), which means that it would take me between 8 and 9 days to formalise my locations.

I am looking to speed up this process

In the beginning, my df looked like this:

enter image description here

Here is the code that I used, I only tried it on a sample since the process is slow:

from geopy import geocoders  
geolocator = geocoders.Nominatim(user_agent='myapplication')

from tqdm.auto import tqdm
tqdm.pandas()

def get_adress(x):
    try:
        return geolocator.geocode(x).address
    except:
        return ""

df_s = df.sample(1000)
df_s["new_loc"] = df_s.user_location.progress_apply(get_adress)
df_s["country"] = df_s.new_loc.apply(lambda x: x.split(",")[-1])

df_s = df_s[df_s.country.apply(lambda x: "United States" in x)]

df_s = df_s[df_s.new_loc.apply(lambda x: len(x.split(","))) > 1]

In the end, my df looked like this, which is what I wanted:

enter image description here

Is there a way to do this faster???

Can you clarify where you got this data from, and what you want to do specifically? I'm questioning the sincerity as an analysis of a 1.6million dataset could imply a lot of drawbacks for the individual person. — 404rorre, Dec 24 '22 at 14:07
I am conducting academic research at University College London to assess the effect of hate speech on Twitter on real-life hate crimes. My case study is on anti-Asian hate speech, and this is a dataset of tweet_ID provided by the CLAWS lab, based at Georgia Tech University. All the data is available freely on their website — lifrah, Dec 24 '22 at 14:17
Did i understand, that you want to filter already available data for locations (USA, ...)? And then later see them on a map? I accidentially have a code which would summarise geolocations to a radius of x km's as a group. I used that for a presentation. Hit me up if that would be to your liking. — 404rorre, Dec 24 '22 at 14:29

score 0 · Accepted Answer · answered Dec 24 '22 at 14:10

Per the docs, geopy is a client for calling various third party services, i.e. it is making network calls on your behalf to services that may be metered.

This is always going to be a very slow process if you want to make millions of API calls. It costs money to provide those services, so you have to be reasonable about your use of the free ones (making millions of requests in a few minutes would not be reasonable).

I quote:

Different services have different Terms of Use, quotas, pricing, geodatabases and so on. For example, Nominatim is free, but provides low request limits. If you need to make more queries, consider using another (probably paid) service, such as OpenMapQuest or PickPoint (these two are commercial providers of Nominatim, so they should have the same data and APIs). Or, if you are ready to wait, you can try geopy.extra.rate_limiter.

So that gives you a few different approaches you could use. I would suggest checking out the pricing for the paid services and seeing what guarantees they impose.

Even then, API calls are always going to be slow-ish as they traverse the internet. You may also need to adopt some parallelization, depending on your requirements.

As a beginner, can I have some more specifications on what is parallelization? — lifrah, Dec 24 '22 at 14:37
@lifrah You could split your dataset in smaller segements, run them seperatly and then merge the whole set again. That way, by cutting the set by half and run them the same time you could double your it/s. Did you read my comment below your question? Maybe there is another way as well.. :) — 404rorre, Dec 24 '22 at 14:53
Parallelisation is a big topic, and not a beginner one. You would be better off searching for some tutorials suitable for your experience level, rather than us having an extended discussion in these comments. @404rorre is right about the general idea behind it. — Paddy Alton, Dec 24 '22 at 15:58

Getting address with geopy takes too long

1 Answers1