I have a list of approx 400k IP (stored in a pandas DataFrame df_IP) to geolocate using maxming geoIP database. I use the City version, and I retrieve the city, lattitude, longitude and county code (departement in France), because some cities have the same name but are in very different places.
Here is my working code :
import geoip2.database
import pandas as pd
reader = geoip2.database.Reader('path/to/GeoLite2-City.mmdb')
results = pd.DataFrame(columns=('IP',
'city',
'latitude',
'longitude',
'dept_code'))
for i, IP in enumerate(df_IP["IP"]):
try :
response = reader.city(IP)
results.loc[i] = [IP,response.city.name,response.location.latitude,response.location.longitude,response.subdivisions.most_specific.iso_code]
except Exception as e:
print ("error with line {}, IP {}: {}").format(i,df_IP["IP"][i],e )
It works well, but it gets slower and slower at each loop. If I time it on the 1000 first IP, I take 4.7s, so the whole 400k should take approx 30 minutes, yet it runs for almost 4 hours.
The only thing IMO that can slow over time is the filling of the Dataframe results
: what alternatives do I have that does not use .loc
and can be faster ? I still need the same dataframe in the end.
I would also be interested in an explanation as to why loc
is so slow on large dataframes.