In the context of a project, I have hydrated 1.6 million tweets, i.e retrieved the metadata associated with the tweets such as date of creation, and location.
My tweet dataset contains tweets from all over the world, however, I am only interested in tweets created in the US. Also, I want to create some statistics by state, and since most of the locations associated with the tweets are wrong or not formalised, I need to formalise them before I do so.
Here are the kind of locations that I have: ['한국어 강제 수용소 (DPRK)', 'Lagos, Nigeria', 'Kolkata, India', 'Who cares', 'Unknown', 'British Columbia, Canada', 'Bitcoin & Markets', 'White Plains, NY', 'Washington, DC']
I was able to create a code that filters all these locations and formalises them, but the problem is that it takes way too long (2.03it/s), which means that it would take me between 8 and 9 days to formalise my locations.
I am looking to speed up this process
In the beginning, my df looked like this:
Here is the code that I used, I only tried it on a sample since the process is slow:
from geopy import geocoders
geolocator = geocoders.Nominatim(user_agent='myapplication')
from tqdm.auto import tqdm
tqdm.pandas()
def get_adress(x):
try:
return geolocator.geocode(x).address
except:
return ""
df_s = df.sample(1000)
df_s["new_loc"] = df_s.user_location.progress_apply(get_adress)
df_s["country"] = df_s.new_loc.apply(lambda x: x.split(",")[-1])
df_s = df_s[df_s.country.apply(lambda x: "United States" in x)]
df_s = df_s[df_s.new_loc.apply(lambda x: len(x.split(","))) > 1]
In the end, my df looked like this, which is what I wanted:
Is there a way to do this faster???