4

I have a csv of about 100 million logs. Where one of the column is address and I am trying to get latitude and longitude of the address. I want to try something like mentioned in the Solution , But the solution given is arcGIS and that is a commercial tool. I did try google API that has limit of 2000 entries only.

What is next best alternative to get address's Lat & Long into the large dataset.

Input: The column Site is the address from the City Paris

start_time,stop_time,duration,input_octets,output_octets,os,browser,device,langue,site
2016-08-27T16:15:00+05:30,2016-08-27T16:28:00+05:30,721.0,69979.0,48638.0,iOS,CFNetwork,iOS-Device,zh_CN,NULL
2016-08-27T16:16:00+05:30,2016-08-27T16:30:00+05:30,835.0,2528858.0,247541.0,iOS,Mobile Safari UIWebView,iPhone,en_GB,Berges de Seine Rive Gauche - Gros Caillou
2016-08-27T16:16:00+05:30,2016-08-27T16:47:00+05:30,1805.0,133303549.0,4304680.0,Android,Android,Samsung GT-N7100,fr_FR,Centre d'Accueil Kellermann
2016-08-27T16:17:00+05:30,,2702.0,32499482.0,7396904.0,Other,Apache-HttpClient,Other,NULL,Bibliothèque Saint Fargeau
2016-08-27T16:17:00+05:30,2016-08-27T17:07:00+05:30,2966.0,39208187.0,1856761.0,iOS,Mobile Safari UIWebView,iPad,fr_FR,NULL
2016-08-27T16:18:00+05:30,,2400.0,1505716.0,342726.0,NULL,NULL,NULL,NULL,NULL
2016-08-27T16:18:00+05:30,,302.0,3424123.0,208827.0,Android,Chrome Mobile,Samsung SGH-I337M,fr_CA,Square Jean Xxiii
2016-08-27T16:19:00+05:30,,1500.0,35035181.0,1913667.0,iOS,Mobile Safari UIWebView,iPhone,fr_FR,Parc Monceau 1 (Entrée)
2016-08-27T16:19:00+05:30,,6301.0,9227174.0,5681273.0,Mac OS X,AppleMail,Other,fr_FR,Bibliothèque Parmentier

The address with NULL can be neglected and also can be removed from the output.

The output should have following columns

start_time,stop_time,duration,input_octets,output_octets,os,browser,device,langue,site, latitude, longitude

Appreciate all the help, Thank you in advance!!

Community
  • 1
  • 1
Sitz Blogz
  • 1,061
  • 6
  • 30
  • 54
  • 1
    You need to provide a [MCVE]. – IanS May 19 '17 at 08:27
  • @InaS I did mention that I worked with Google Api and that has limitation, next best suitable solution I found was with arcgis and that is commercial package. Hence i have no proper way to put up a code, if I put the Google API code then the whole question will go in different direction.. I have tried that already few days ago in another question.. – Sitz Blogz May 19 '17 at 08:33
  • Just to clarify, you're mainly asking for a way to get lat/lon coordinates for a large data set and less about how to manipulate the data with pandas/python. Is that right? – elPastor May 19 '17 at 09:57
  • @pshep I am open to manipulation and also getting lat lon – Sitz Blogz May 19 '17 at 10:08
  • 1
    I would suggest to use geopy with the Open Street Map Nominatim geocoder: https://geopy.readthedocs.io/en/1.10.0/#geopy.geocoders.Nominatim – Below the Radar May 19 '17 at 12:04
  • @BelowtheRadar Thank you for the suggestion .. Do u think you can help me with some code please .. – Sitz Blogz May 19 '17 at 12:06
  • This question would *still* be a lot better if it could be reduced to a MRE like @IanS recommended ... if the OP has time to do so please consider it, for the benefit of SO in general. Thanks! – Max von Hippel Aug 14 '19 at 23:44

2 Answers2

4
import csv
from geopy.geocoders import Nominatim

#if your sites are located in France only you can use the country_bias parameters to restrict search
geolocator = Nominatim(country_bias="France")

with open('c:/temp/input.csv', 'rb') as csvinput:
    with open('c:/temp/output.csv', 'wb') as csvoutput:
       output_fieldnames = ['Site', 'Address_found', 'Latitude', 'Longitude']
       writer = csv.DictWriter(csvoutput, delimiter=';', fieldnames=output_fieldnames)
       writer.writeheader()
       reader = csv.DictReader(csvinput)
       for row in reader:
            site = row['site']
            if site != "NULL":
                try:
                    location = geolocator.geocode(site)
                    address = location.address
                    latitude = location.latitude
                    longitude = location.longitude
                except:
                    address = 'Not found'
                    latitude = 'N/A'
                    longitude = 'N/A'
            else:
                address = 'N/A'
                latitude = 'N/A'
                longitude = 'N/A'

            #here is the writing section
            output_row = {}
            output_row['Site'] = row['site']
            output_row['Address_found'] = address.encode("utf-8")
            output_row['Latitude'] = latitude
            output_row['Longitude'] = longitude
            writer.writerow(output_row)
Below the Radar
  • 7,321
  • 11
  • 63
  • 142
0

This is now baked directly into Geopandas: https://geopandas.org/en/stable/docs/user_guide/geocoding.html

Taken directly from their docs:

import geodatasets

boros = geopandas.read_file(geodatasets.get_path("nybb"))

boros.BoroName
Out[3]: 
0    Staten Island
1           Queens
2         Brooklyn
3        Manhattan
4            Bronx
Name: BoroName, dtype: object

boro_locations = geopandas.tools.geocode(boros.BoroName)

boro_locations
Out[5]: 
                     geometry                                           address
0  POINT (-74.14960 40.58346)  Staten Island, New York, New York, United States
1  POINT (-73.82831 40.71351)         Queens, New York, New York, United States
2  POINT (-73.94972 40.65260)       Brooklyn, New York, New York, United States
3  POINT (-73.95989 40.78962)      Manhattan, New York, New York, United States
4  POINT (-73.87859 40.84665)      The Bronx, New York, New York, United States
Ben
  • 12,614
  • 4
  • 37
  • 69