4

I have a dataframe of around 100k rows with postcodes and country codes. I would like to get the latitude and longitude of each location and save it in two new columns. I have a working code on a sample of the dataframe (e.g. 100 rows), but running it on all dataframe takes so long (>1hour). I am new to Python and suspect, there should be a faster way to do so in terms of:

  1. For a given postcode and country_code, I am querying twice, once for latitude and once for longitude. I am pretty convinced that I should not do that (i.e. I can make a single query per line and create latitude and longitude columns accordingly).
  2. The way I define the function get_lat(pcode, country) and get_long(pcode, country) and apply on the dataframe is not efficient.

An example of my code is below.

import pgeocode
import numpy as np
import pandas as pd

#Sample data
df = pd.DataFrame({'postcode':['3011','3083','3071','2660','9308','9999'], 'country_code': ['NL','NL','NL','BE','BE','DE']})

#There are blank postcodes and postcodes that pgeocode cannot return any value, so I am using try-except (e.g. last row in sample dataframe): 
#function to get latitude 
def get_lat(pcode, country):
    try:
        nomi = pgeocode.Nominatim(country)
        x = nomi.query_postal_code(pcode).latitude
        return x
    except:
        return np.NaN

#function to get longitude
def get_long(pcode, country):
    try:
        nomi = pgeocode.Nominatim(country)
        x = nomi.query_postal_code(pcode).longitude
        return x
    except:
        return np.NaN

#Find and create new columns for latitude-longitude based on postcode (ex: 5625) and country of postcode (ex: NL)
df['latitude'] = np.vectorize(get_lat)(df['postcode'],df['country_code'])
df['longitude'] = np.vectorize(get_long)(df['postcode'],df['country_code'])
volkan g
  • 186
  • 10

1 Answers1

1

As an alternative solution, I downloaded the txt files from this website: http://download.geonames.org/export/zip/

After downloading the files, it is simply a matter of importing the txt file and joining. It is much faster but static, i.e. you use a snapshot of the postcode database at an earlier time.

Another advantage is that you can check the files and inspect the format of the postcodes. While using pgeocode, it is harder to keep track of the accepted postcode format and understand why queries return null.

volkan g
  • 186
  • 10