2

Question First

How can I search through my SQLite database as quickly as possible?

Should I parse the address data from all 60,000 rows in Excel, load those into a list, then just search for all of them at once?

Switching from simply looking through a plain text file sped up my script by 3 times, but I still think it could go faster.

Thank you in advance!


The Database

I have a SQLite database of city names, their postal code, coordinates, etc. that I created from Geonames' postal codes data dump: Geonames Postal Codes

The database has a table for for each country (DE,US,GB,etc.. 72 in all), and each of those tables has between a a couple dozen to tens of thousands of lines each, in the following format:

country code      : iso country code, 2 characters
postal code       : varchar(20)
place name        : varchar(180)
admin name1       : 1. order subdivision (state) varchar(100)
admin code1       : 1. order subdivision (state) varchar(20)
admin name2       : 2. order subdivision (county/province) varchar(100)
admin code2       : 2. order subdivision (county/province) varchar(20)
admin name3       : 3. order subdivision (community) varchar(100)
admin code3       : 3. order subdivision (community) varchar(20)
latitude          : estimated latitude (wgs84)
longitude         : estimated longitude (wgs84)
accuracy          : accuracy of lat/lng from 1=estimated to 6=centroid

Workflow

Now my current script in Python goes as follows:

  • Read row in Excel file
  • Parse address and location data (a lot of other, unrelated stuff in between)
  • Search for a match in SQLite database
  • Write the information from matching row in SQLite db to a .CSV file

The Excel file is about 60,000 rows and goes through my entire Python script(above process) for each row.

My address data is very inconsistent, containing a mix of postal codes, city names, and country names. Sometimes all of this data is in the Excel row, sometimes not. And it also comes with many misspellings and alternate names.

So because the data is so inconsistent, and because sometimes people put down postal codes and cities that don't match, I currently have my Python script try a bunch of different search queries, like:

  • Check if [postal code] matches column exactly AND [place name] matches column exactly
  • Check if [postal code] matches column exactly AND [place name] in column
  • Check if [postal code] matches column exactly AND [place name](split up by word) in column
  • Check if just[postal code] matches column

Python Script

Here is the section of the Python script. As you can see, it seems pretty inefficient:

if has_country_code == True:
    not_in_list = False
    country = country_code.lower()+"_"
    print "HAS COUNTRY"
    if has_zipcode == True and has_city_name == True:
        print "HAS COUNTRY2"
        success = False

        try:
            curs = conn.execute("SELECT * FROM "+country+" WHERE postal_code = ? AND place_name = ? COLLATE NOCASE", (zipcode, city,))

            for row in curs:
                success = True
                break   
        except:
            not_in_list = True
            success = True

        if success != True:  
            curs = conn.execute("SELECT * FROM "+country+" WHERE postal_code = ? AND place_name LIKE ? COLLATE NOCASE", (zipcode,"%"+city+"%",))

            for row in curs:
                success = True
                break

        if success != True:
            newCity = ""  
            newCity = filter(None,re.split('[; / ( ) - ,]',city))
            questionMarks = ",".join(["?" for w in newCity])


            curs = conn.execute("SELECT * FROM "+country+" WHERE postal_code = ? AND place_name IN ("+questionMarks+") COLLATE NOCASE", ([zipcode]+newCity))

            for row in curs:
                success = True
                break   


        if success != True:
            curs = conn.execute("SELECT * FROM "+country+" WHERE postal_code = ? COLLATE NOCASE", (zipcode,))

            for row in curs:
                success = True
                break   


        if success != True:

            curs = conn.execute("SELECT * FROM "+country+" WHERE place_name = ? COLLATE NOCASE", (city,))

            for row in curs:
                success = True
                break   

        if success != True:

            curs = conn.execute("SELECT * FROM "+country+" WHERE place_name LIKE ? COLLATE NOCASE", ("%"+city+"%",))

            for row in curs:
                success = True
                break

        if success != True:
            newCity = ""  
            newCity = filter(None,re.split('[; / ( ) - ,]',city))
            questionMarks = ",".join(["?" for w in newCity])


            curs = conn.execute("SELECT * FROM "+country+" WHERE place_name IN ("+questionMarks+") COLLATE NOCASE", (newCity))

            for row in curs:
                success = True
                break

        if success != True:     
            newCity = ""                   
            newCity = filter(None,re.split('[; / ( ) - ,]',city))
            newCity.sort(key=len, reverse=True)
            newCity = (["%"+w+"%" for w in newCity])

            for item in newCity:
                curs = conn.execute("SELECT * FROM "+country+" WHERE place_name LIKE (?) COLLATE NOCASE", (item,))

                for row in curs:
                    success = True
                    break
                break   





    if has_city_name == True and has_zipcode == False:        
        try:
            curs = conn.execute("SELECT * FROM "+country+" WHERE place_name = ? COLLATE NOCASE", (city,))

            for row in curs:
                success = True
                break   
        except:
            not_in_list = True
            success = True

        if success != True:
            curs = conn.execute("SELECT * FROM "+country+" WHERE place_name LIKE ? COLLATE NOCASE", ("%"+city+"%",))

            for row in curs:
                success = True
                break

        if success != True:
            newCity = ""  
            newCity = filter(None,re.split('[; / ( ) - ,]',city))
            questionMarks = ",".join(["?" for w in newCity])


            curs = conn.execute("SELECT * FROM "+country+" WHERE place_name IN ("+questionMarks+") COLLATE NOCASE", (newCity))

            for row in curs:
                success = True
                break 



        if success != True:     
            newCity = ""                   
            newCity = filter(None,re.split('[; / ( ) - ,]',city))
            newCity.sort(key=len, reverse=True)
            newCity = (["%"+w+"%" for w in newCity])

            for item in newCity:
                curs = conn.execute("SELECT * FROM "+country+" WHERE place_name LIKE (?) COLLATE NOCASE", (item,))

                for row in curs:
                    success = True
                    break
                break   




    if has_city_name == False and has_zipcode == True:
        try:
            curs = conn.execute("SELECT * FROM "+country+" WHERE postal_code = ?", (zipcode,))

            for row in curs:
                success = True
                break

        except:
            not_in_list = True
            success = True
Mat
  • 202,337
  • 40
  • 393
  • 406
yannikrock
  • 55
  • 2
  • 5

2 Answers2

2

This may be one of those cases where you may need to try different approaches and then see if each is fast enough. As @Philip suggested indexes would be a good starting point, if you don't already have an index on at least postcode this should significantly improve performance.

If you already have this or want to try for further gain I would consider loading the Excel data into your SQLite database and trying the do this as one big query (it will need to do a full table scan of everything due to the number of matches that you are trying to get, but doing this once may not be too bad.

If this is not getting the results you want or proving to difficult to get the query right you could try loading all of the SQLite data into Python and build into dictionaries that will sort the data by what you need to look up e.g. one level of dictionary for the countries, within each country have all of the postcodes, within each postcode have a list of all of the records for that country / postcode.

Basically, the theme of this is make sure you are doing your look ups against hash table type structures (sorted key-value pairs such as database indexes, a python dictionary etc.) or if you do go through record by record don't do this for each record in your other dataset.

ChrisProsser
  • 12,598
  • 6
  • 35
  • 44
1

For misspelling detection, you can look at phonetic algorithms (but each is made only for a particular language) because especially the LIKE '%city%' will remain inefficient even with indexes.

Additionally you can try to reorder excel data by country (first ask for DE, then US, ...). So Sqlite can "concentrate" on one table and doesn't have to switch all the time, moreover the prepared statement cache of Python's Sqlite-wrapper works more efficiently.

EDIT:

A prepared statement is a previously parsed and analyzed SQL-statement. Executing it multiple times is more efficient than creating and preparing a new statement all the time. Python's Sqlite-wrapper caches some prepared statements and reuses them if the exactly same SQL-statement string is used again.

Michael Butscher
  • 10,028
  • 4
  • 24
  • 25
  • 1
    I once tried using this method as a spell checker: `if difflib.SequenceMatcher(None, country.lower(), line[1].lower()).ratio() > .90:` but it was terribly slow. Simple, but slow. Is there no way to do fuzzy matching with SQLite? I googled "prepared statement cache" and didn't really understand what it was/ how to implement it. Is it on by default? Or do I need to change something in my code? – yannikrock Aug 09 '13 at 10:44
  • @rockyan An efficient fuzzy search is complicated. Things like the soundex key are simpler (but language-specific). I have edited my answer regarding prepared statements. – Michael Butscher Aug 09 '13 at 18:07