Question First
How can I search through my SQLite database as quickly as possible?
Should I parse the address data from all 60,000 rows in Excel, load those into a list, then just search for all of them at once?
Switching from simply looking through a plain text file sped up my script by 3 times, but I still think it could go faster.
Thank you in advance!
The Database
I have a SQLite database of city names, their postal code, coordinates, etc. that I created from Geonames' postal codes data dump: Geonames Postal Codes
The database has a table for for each country (DE,US,GB,etc.. 72 in all), and each of those tables has between a a couple dozen to tens of thousands of lines each, in the following format:
country code : iso country code, 2 characters
postal code : varchar(20)
place name : varchar(180)
admin name1 : 1. order subdivision (state) varchar(100)
admin code1 : 1. order subdivision (state) varchar(20)
admin name2 : 2. order subdivision (county/province) varchar(100)
admin code2 : 2. order subdivision (county/province) varchar(20)
admin name3 : 3. order subdivision (community) varchar(100)
admin code3 : 3. order subdivision (community) varchar(20)
latitude : estimated latitude (wgs84)
longitude : estimated longitude (wgs84)
accuracy : accuracy of lat/lng from 1=estimated to 6=centroid
Workflow
Now my current script in Python goes as follows:
- Read row in Excel file
- Parse address and location data (a lot of other, unrelated stuff in between)
- Search for a match in SQLite database
- Write the information from matching row in SQLite db to a .CSV file
The Excel file is about 60,000 rows and goes through my entire Python script(above process) for each row.
My address data is very inconsistent, containing a mix of postal codes, city names, and country names. Sometimes all of this data is in the Excel row, sometimes not. And it also comes with many misspellings and alternate names.
So because the data is so inconsistent, and because sometimes people put down postal codes and cities that don't match, I currently have my Python script try a bunch of different search queries, like:
- Check if [postal code] matches column exactly AND [place name] matches column exactly
- Check if [postal code] matches column exactly AND [place name] in column
- Check if [postal code] matches column exactly AND [place name](split up by word) in column
- Check if just[postal code] matches column
Python Script
Here is the section of the Python script. As you can see, it seems pretty inefficient:
if has_country_code == True:
not_in_list = False
country = country_code.lower()+"_"
print "HAS COUNTRY"
if has_zipcode == True and has_city_name == True:
print "HAS COUNTRY2"
success = False
try:
curs = conn.execute("SELECT * FROM "+country+" WHERE postal_code = ? AND place_name = ? COLLATE NOCASE", (zipcode, city,))
for row in curs:
success = True
break
except:
not_in_list = True
success = True
if success != True:
curs = conn.execute("SELECT * FROM "+country+" WHERE postal_code = ? AND place_name LIKE ? COLLATE NOCASE", (zipcode,"%"+city+"%",))
for row in curs:
success = True
break
if success != True:
newCity = ""
newCity = filter(None,re.split('[; / ( ) - ,]',city))
questionMarks = ",".join(["?" for w in newCity])
curs = conn.execute("SELECT * FROM "+country+" WHERE postal_code = ? AND place_name IN ("+questionMarks+") COLLATE NOCASE", ([zipcode]+newCity))
for row in curs:
success = True
break
if success != True:
curs = conn.execute("SELECT * FROM "+country+" WHERE postal_code = ? COLLATE NOCASE", (zipcode,))
for row in curs:
success = True
break
if success != True:
curs = conn.execute("SELECT * FROM "+country+" WHERE place_name = ? COLLATE NOCASE", (city,))
for row in curs:
success = True
break
if success != True:
curs = conn.execute("SELECT * FROM "+country+" WHERE place_name LIKE ? COLLATE NOCASE", ("%"+city+"%",))
for row in curs:
success = True
break
if success != True:
newCity = ""
newCity = filter(None,re.split('[; / ( ) - ,]',city))
questionMarks = ",".join(["?" for w in newCity])
curs = conn.execute("SELECT * FROM "+country+" WHERE place_name IN ("+questionMarks+") COLLATE NOCASE", (newCity))
for row in curs:
success = True
break
if success != True:
newCity = ""
newCity = filter(None,re.split('[; / ( ) - ,]',city))
newCity.sort(key=len, reverse=True)
newCity = (["%"+w+"%" for w in newCity])
for item in newCity:
curs = conn.execute("SELECT * FROM "+country+" WHERE place_name LIKE (?) COLLATE NOCASE", (item,))
for row in curs:
success = True
break
break
if has_city_name == True and has_zipcode == False:
try:
curs = conn.execute("SELECT * FROM "+country+" WHERE place_name = ? COLLATE NOCASE", (city,))
for row in curs:
success = True
break
except:
not_in_list = True
success = True
if success != True:
curs = conn.execute("SELECT * FROM "+country+" WHERE place_name LIKE ? COLLATE NOCASE", ("%"+city+"%",))
for row in curs:
success = True
break
if success != True:
newCity = ""
newCity = filter(None,re.split('[; / ( ) - ,]',city))
questionMarks = ",".join(["?" for w in newCity])
curs = conn.execute("SELECT * FROM "+country+" WHERE place_name IN ("+questionMarks+") COLLATE NOCASE", (newCity))
for row in curs:
success = True
break
if success != True:
newCity = ""
newCity = filter(None,re.split('[; / ( ) - ,]',city))
newCity.sort(key=len, reverse=True)
newCity = (["%"+w+"%" for w in newCity])
for item in newCity:
curs = conn.execute("SELECT * FROM "+country+" WHERE place_name LIKE (?) COLLATE NOCASE", (item,))
for row in curs:
success = True
break
break
if has_city_name == False and has_zipcode == True:
try:
curs = conn.execute("SELECT * FROM "+country+" WHERE postal_code = ?", (zipcode,))
for row in curs:
success = True
break
except:
not_in_list = True
success = True