8

I have a function find_country_from_connection_ip which takes an ip, and after some processing returns a country. Like below:

def find_country_from_connection_ip(ip):
    # Do some processing
    return county

I am using the function inside apply method. like below:

df['Country'] = df.apply(lambda x: find_country_from_ip(x['IP']), axis=1)

As it is pretty straightforward, what I want is to evaluate a new column from an existing column in the DataFrame which has >400000 rows.

It runs, but terribly slow and throws an exception like below:

...........: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

if name == 'main': In [38]:

I understand the problem, but can't quite figure out how to use loc with apply and lambda.

N.B. Please suggest if you have a more efficient alternative solution, which can bring the end result.

**** EDIT ********

The function is mainly a lookup on mmdb database like below:

def find_country_from_ip(ip):
    result = subprocess.Popen("mmdblookup --file GeoIP2-Country.mmdb --ip {} country names en".format(ip).split(" "), stdout=subprocess.PIPE).stdout.read()
    if result:
        return re.search(r'\"(.+?)\"', result).group(1) 
    else:
        final_output = subprocess.Popen("mmdblookup --file GeoIP2-Country.mmdb --ip {} registered_country names en".format(ip).split(" "), stdout=subprocess.PIPE).stdout.read()
        return re.search(r'\"(.+?)\"', final_output).group(1)

This is nevertheless a costly operation, and when you have a DataFrame with >400000 rows, it should take time. But how much? That is the question. It takes about 2 hours which is pretty much I think.

MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
Ahsanul Haque
  • 10,676
  • 4
  • 41
  • 57
  • I think more efficient solution can be omit `apply` and rewrite custom function to some pandas vectorized function if possible. – jezrael Oct 24 '16 at 05:54
  • So can you add to question all function `find_country_from_ip` ? – jezrael Oct 24 '16 at 05:56
  • @jezrael, edited. You may take a look now. – Ahsanul Haque Oct 24 '16 at 06:04
  • Hmmm, it is hard, maybe I give you only some suggestion. `subprocess.Popen("mmdblookup --file GeoIP2-Country.mmdb --ip {} country names en".format(ip).split(" "), stdout=subprocess.PIPE).stdout.read()` has to be called to each row? Or do you need call it to each unique `IP` ? Are there duplicate `IP` or not? – jezrael Oct 24 '16 at 06:08
  • 1
    What is `print (len(df.IP.drop_duplicates()))` ? – jezrael Oct 24 '16 at 06:16

5 Answers5

8

I would use maxminddb-geolite2 (GeoLite) module for that.

First install maxminddb-geolite2 module

pip install maxminddb-geolite2

Python Code:

import pandas as pd
from geolite2 import geolite2

def get_country(ip):
    try:
        x = geo.get(ip)
    except ValueError:
        return pd.np.nan
    try:
        return x['country']['names']['en'] if x else pd.np.nan
    except KeyError:
        return pd.np.nan

geo = geolite2.reader()

# it took me quite some time to find a free and large enough list of IPs ;)
# IP's for testing: http://upd.emule-security.org/ipfilter.zip
x = pd.read_csv(r'D:\download\ipfilter.zip',
                usecols=[0], sep='\s*\-\s*',
                header=None, names=['ip'])

# get unique IPs
unique_ips = x['ip'].unique()
# make series out of it
unique_ips = pd.Series(unique_ips, index = unique_ips)
# map IP --> country
x['country'] = x['ip'].map(unique_ips.apply(get_country))

geolite2.close()

Output:

In [90]: x
Out[90]:
                     ip     country
0       000.000.000.000         NaN
1       001.002.004.000         NaN
2       001.002.008.000         NaN
3       001.009.096.105         NaN
4       001.009.102.251         NaN
5       001.009.106.186         NaN
6       001.016.000.000         NaN
7       001.055.241.140         NaN
8       001.093.021.147         NaN
9       001.179.136.040         NaN
10      001.179.138.224    Thailand
11      001.179.140.200    Thailand
12      001.179.146.052         NaN
13      001.179.147.002    Thailand
14      001.179.153.216    Thailand
15      001.179.164.124    Thailand
16      001.179.167.188    Thailand
17      001.186.188.000         NaN
18      001.202.096.052         NaN
19      001.204.179.141       China
20      002.051.000.165         NaN
21      002.056.000.000         NaN
22      002.095.041.202         NaN
23      002.135.237.106  Kazakhstan
24      002.135.237.250  Kazakhstan
...                 ...         ...

Timing: for 171.884 unique IPs:

In [85]: %timeit unique_ips.apply(get_country)
1 loop, best of 3: 14.8 s per loop

In [86]: unique_ips.shape
Out[86]: (171884,)

Conclusion: it would take approx. 35 seconds for you DF with 400K unique IPs on my hardware:

In [93]: 400000/171884*15
Out[93]: 34.90726303786274
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
1

IIUC you can use your custom function with Series.apply this way:

df['Country'] = df['IP'].apply(find_country_from_ip)

Sample:

df = pd.DataFrame({'IP':[1,2,3],
                   'B':[4,5,6]})




def find_country_from_ip(ip):
            # Do some processing 
            # some testing formula
            country = ip + 5
            return country



   df['Country'] = df['IP'].apply(find_country_from_ip)

print (df)
   B  IP  Country
0  4   1        6
1  5   2        7
2  6   3        8
Sumax
  • 631
  • 1
  • 7
  • 13
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
1

Your issue isn't with how to use apply or loc. The issue is that your df is flagged as a copy of another dataframe.

Let's explore this a bit

df = pd.DataFrame(dict(IP=[1, 2, 3], A=list('xyz')))
df

enter image description here

def find_country_from_connection_ip(ip):
    return {1: 'A', 2: 'B', 3: 'C'}[ip]

df['Country'] = df.IP.apply(find_country_from_connection_ip)
df

enter image description here

No Problems
Let's make some problems

# This should make a copy
print(bool(df.is_copy))
df = df[['A', 'IP']]
print(df)
print(bool(df.is_copy))

False
   A  IP
0  x   1
1  y   2
2  z   3
True

Perfect, now we have a copy. Let's perform the same assignment with the apply

df['Country'] = df.IP.apply(find_country_from_connection_ip)
df
//anaconda/envs/3.5/lib/python3.5/site-packages/ipykernel/__main__.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':

enter image description here


how do you fix it?
Where ever you created df you can use df.loc. My example above, where I did df = df[:] triggered the copy. If I had used loc instead, I'd have avoided this mess.

print(bool(df.is_copy))
df = df.loc[:]
print(df)
print(bool(df.is_copy))

False
   A  IP
0  x   1
1  y   2
2  z   3
False

You need to either find where df is created and use loc or iloc instead when you slice the source dataframe. Or, you can simply do this...

df.is_copy = None

The full demonstration

df = pd.DataFrame(dict(IP=[1, 2, 3], A=list('xyz')))

def find_country_from_connection_ip(ip):
    return {1: 'A', 2: 'B', 3: 'C'}[ip]

df = df[:]

df.is_copy = None

df['Country'] = df.IP.apply(find_country_from_connection_ip)
df

enter image description here

piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • Thanks, nicely explained. But, after using `d2 = d1.loc[d1['Country'] == '$$']` , `bool(d2.is_copy)` still evaluates to `True`. And, just 100 row takes about `5 seconds`, which means `400000` rows will take about `6 hours` to complete. Quite a long time, isn't it? – Ahsanul Haque Oct 24 '16 at 09:38
  • Only way to make that faster is to let us see the ip conversion code to see if we can vectorize it in any way. – piRSquared Oct 24 '16 at 12:39
0

First and foremost, @MaxU 's answer is the way to go, efficient and ideal for parallel application on vectorized pd.series/dataframe.

Will contrast performance of two popular libraries to return location data given IP Address info. TLDR: use geolite2 method.

1. geolite2 package from geolite2 library

Input

# !pip install maxminddb-geolite2
import time
from geolite2 import geolite2
geo = geolite2.reader()
df_1 = train_data.loc[:50,['IP_Address']]

def IP_info_1(ip):
    try:
        x = geo.get(ip)
    except ValueError:   #Faulty IP value
        return np.nan
    try:
        return x['country']['names']['en'] if x is not None else np.nan
    except KeyError:   #Faulty Key value
        return np.nan


s_time = time.time()
# map IP --> country
#apply(fn) applies fn. on all pd.series elements
df_1['country'] = df_1.loc[:,'IP_Address'].apply(IP_info_1)
print(df_1.head(), '\n')
print('Time:',str(time.time()-s_time)+'s \n')

print(type(geo.get('48.151.136.76')))

Output

       IP_Address         country
0   48.151.136.76   United States
1    94.9.145.169  United Kingdom
2   58.94.157.121           Japan
3  193.187.41.186         Austria
4   125.96.20.172           China 

Time: 0.09906983375549316s 

<class 'dict'>

2. DbIpCity package from ip2geotools library

Input

# !pip install ip2geotools
import time
s_time = time.time()
from ip2geotools.databases.noncommercial import DbIpCity
df_2 = train_data.loc[:50,['IP_Address']]
def IP_info_2(ip):
    try:
        return DbIpCity.get(ip, api_key = 'free').country
    except:
        return np.nan
df_2['country'] = df_2.loc[:, 'IP_Address'].apply(IP_info_2)
print(df_2.head())
print('Time:',str(time.time()-s_time)+'s')

print(type(DbIpCity.get('48.151.136.76',api_key = 'free')))

Output

       IP_Address country
0   48.151.136.76      US
1    94.9.145.169      GB
2   58.94.157.121      JP
3  193.187.41.186      AT
4   125.96.20.172      CN

Time: 80.53318452835083s 

<class 'ip2geotools.models.IpLocation'>

A reason why the huge time difference could be due to the Data structure of the output, i.e direct subsetting from dictionaries seems way more efficient than indexing from the specicialized ip2geotools.models.IpLocation object.

Also, the output of the 1st method is dictionary containing geo-location data, subset respecitively to obtain needed info:

x = geolite2.reader().get('48.151.136.76')
print(x)

>>>
    {'city': {'geoname_id': 5101798, 'names': {'de': 'Newark', 'en': 'Newark', 'es': 'Newark', 'fr': 'Newark', 'ja': 'ニューアーク', 'pt-BR': 'Newark', 'ru': 'Ньюарк'}},

 'continent': {'code': 'NA', 'geoname_id': 6255149, 'names': {'de': 'Nordamerika', 'en': 'North America', 'es': 'Norteamérica', 'fr': 'Amérique du Nord', 'ja': '北アメリカ', 'pt-BR': 'América do Norte', 'ru': 'Северная Америка', 'zh-CN': '北美洲'}}, 

'country': {'geoname_id': 6252001, 'iso_code': 'US', 'names': {'de': 'USA', 'en': 'United States', 'es': 'Estados Unidos', 'fr': 'États-Unis', 'ja': 'アメリカ合衆国', 'pt-BR': 'Estados Unidos', 'ru': 'США', 'zh-CN': '美国'}}, 

'location': {'accuracy_radius': 1000, 'latitude': 40.7355, 'longitude': -74.1741, 'metro_code': 501, 'time_zone': 'America/New_York'}, 

'postal': {'code': '07102'}, 

'registered_country': {'geoname_id': 6252001, 'iso_code': 'US', 'names': {'de': 'USA', 'en': 'United States', 'es': 'Estados Unidos', 'fr': 'États-Unis', 'ja': 'アメリカ合衆国', 'pt-BR': 'Estados Unidos', 'ru': 'США', 'zh-CN': '美国'}}, 

'subdivisions': [{'geoname_id': 5101760, 'iso_code': 'NJ', 'names': {'en': 'New Jersey', 'es': 'Nueva Jersey', 'fr': 'New Jersey', 'ja': 'ニュージャージー州', 'pt-BR': 'Nova Jérsia', 'ru': 'Нью-Джерси', 'zh-CN': '新泽西州'}}]}
Sumax
  • 631
  • 1
  • 7
  • 13
0

i passed in a dataframe with an ipaddress column through the below code - there were about 300k rows in the df. this took about 20 seconds.

import pandas as pd
from geolite2 import geolite2

def get_country(row,ip):
    try:
        x = geo.get(row[ip])
    except ValueError:
        return pd.np.nan
    try:
        return x['country']['names']['en'] if x else pd.np.nan
    except KeyError:
        return pd.np.nan

geo = geolite2.reader()

# map IP --> country
df_test['login_ip_country'] = df_test.apply(lambda row: get_country(row,'login_ip_address'), axis = 1)
df_test['registered_ip_country'] = df_test.apply(lambda row: get_country(row,'registered_ip_address'), axis = 1)

geolite2.close()

df_test.head()

don't need to make it a series. just pass in the 'row' into your function, which acts as the 'df'

max
  • 671
  • 5
  • 13