I have a dataframe of around 100k rows with postcodes and country codes. I would like to get the latitude and longitude of each location and save it in two new columns. I have a working code on a sample of the dataframe (e.g. 100 rows), but running it on all dataframe takes so long (>1hour). I am new to Python and suspect, there should be a faster way to do so in terms of:
- For a given
postcode
andcountry_code
, I am querying twice, once for latitude and once for longitude. I am pretty convinced that I should not do that (i.e. I can make a single query per line and create latitude and longitude columns accordingly). - The way I define the function
get_lat(pcode, country)
andget_long(pcode, country)
and apply on the dataframe is not efficient.
An example of my code is below.
import pgeocode
import numpy as np
import pandas as pd
#Sample data
df = pd.DataFrame({'postcode':['3011','3083','3071','2660','9308','9999'], 'country_code': ['NL','NL','NL','BE','BE','DE']})
#There are blank postcodes and postcodes that pgeocode cannot return any value, so I am using try-except (e.g. last row in sample dataframe):
#function to get latitude
def get_lat(pcode, country):
try:
nomi = pgeocode.Nominatim(country)
x = nomi.query_postal_code(pcode).latitude
return x
except:
return np.NaN
#function to get longitude
def get_long(pcode, country):
try:
nomi = pgeocode.Nominatim(country)
x = nomi.query_postal_code(pcode).longitude
return x
except:
return np.NaN
#Find and create new columns for latitude-longitude based on postcode (ex: 5625) and country of postcode (ex: NL)
df['latitude'] = np.vectorize(get_lat)(df['postcode'],df['country_code'])
df['longitude'] = np.vectorize(get_long)(df['postcode'],df['country_code'])