2

I have a list of approx 400k IP (stored in a pandas DataFrame df_IP) to geolocate using maxming geoIP database. I use the City version, and I retrieve the city, lattitude, longitude and county code (departement in France), because some cities have the same name but are in very different places.

Here is my working code :

import geoip2.database
import pandas as pd

reader = geoip2.database.Reader('path/to/GeoLite2-City.mmdb')
results = pd.DataFrame(columns=('IP',
                                'city',
                                'latitude',
                                'longitude',
                                'dept_code'))

for i, IP in enumerate(df_IP["IP"]):
    try :
        response = reader.city(IP)
        results.loc[i] = [IP,response.city.name,response.location.latitude,response.location.longitude,response.subdivisions.most_specific.iso_code]
    except Exception as e:
        print ("error with line {}, IP {}: {}").format(i,df_IP["IP"][i],e )

It works well, but it gets slower and slower at each loop. If I time it on the 1000 first IP, I take 4.7s, so the whole 400k should take approx 30 minutes, yet it runs for almost 4 hours.

The only thing IMO that can slow over time is the filling of the Dataframe results : what alternatives do I have that does not use .locand can be faster ? I still need the same dataframe in the end.

I would also be interested in an explanation as to why locis so slow on large dataframes.

CoMartel
  • 3,521
  • 4
  • 25
  • 48
  • Have you thought about using one of pandas` iterators (e.g. `iterrows()`) to iterate over your rows and use `apply` with your reader function to create one new column with a string that has all your geodata? You could then split the strings to create individual columns for all your geodata. Not sure if that would be faster, but when iterating over a dataframe it's usually best practice to use something like `iterrows()`. – Khris Feb 16 '17 at 12:56
  • I have had a similar problem in the past with using `loc` being very slow when in a for loop. I found I could circumvent the problem by generating the data for the new column as a separate list, and then reassign it in this form. This required more lines of code and was a bit uglier, but had much better performance than `loc`. Might be worth considering if you can apply this. – oliversm Feb 16 '17 at 12:56
  • @oliversm can you elaborate ? I don't really understand your trick. – CoMartel Feb 16 '17 at 13:04
  • @Khris, no I've never user iterrows, I will try – CoMartel Feb 16 '17 at 13:04
  • Are the IPs unique? – Khris Feb 16 '17 at 13:13
  • @Khris Yes, I have dropped the duplicates – CoMartel Feb 16 '17 at 13:14
  • Then there is no difference in the number of calls to the geoip database and trying out `iterrows()` seems like a good idea. – Khris Feb 16 '17 at 13:17

2 Answers2

1

I was faced with a similar situation as loc was causing the runtime to blow up for me. After fiddling a lot, I found a simple solution that is super fast. Use set_value instead of loc.

This is how a sample code will look like: You can tweak it for your usecase. Say your dataframe is like this

Index  'A'  'B' 'Label'
23      0    1    Y
45      3    2    N

self.data.set_value(45,'Label,'NA')

This will set the value of the column "Label" as NA for the second row.

More on set_value can be read from the link below :

http://pandas.pydata.org/pandas-docs/version/0.17/generated/pandas.DataFrame.set_value.html

Megha
  • 220
  • 2
  • 13
0

I have had the same problem and as @oliversm suggests I created a list and then added it to the original dataset. Here is what the code would look like:

....

results_list=[]

for i, IP in enumerate(df_IP["IP"]):
    try :
        response = reader.city(IP)
     results_list.append( response.city.name,response.location.latitude,response.location.longitude,response.subdivisions.most_specific.iso_code)
    except Exception as e:
        print ("error with line {}, IP {}: {}").format(i,df_IP["IP"][i],e )

results_array=np.asarray(results_list) #list to array to add to the dataframe as a new column

results['results_column']=pd.Series(results_array,index=results.index)
Dadep
  • 2,796
  • 5
  • 27
  • 40