1

Existing Database and Desired Outcome:

I have a larger SQLite database (12gb, tables with 44 million+ rows) that I would like to modify using Pandas in Python3.

Example Objective: I hope to read one of these large tables (44 million rows) into a DF in chunks, manipulate the DF chunk, and write the result to a new table. If possible, I would like to replace the new table if it exists, and append each chunk to it.

Because my manipulations only add or modify columns, the new table should have the same number of rows as the original table.

Issues:

The primary issue seems to stem from the following line in the below code:

df.to_sql(new_table, con=db, if_exists = "append", index=False)

  1. When this line is run in the below code, I seem to consistently get an additional chunk of size=N, plus one observation than I expected.
  2. The first time this code runs with a new table name, I get an error:
 Traceback (most recent call last):
  File "example.py", line 23, in <module>
    for df in df_generator:
  File "/usr/local/lib/python3.5/site-packages/pandas/io/sql.py", line 1420, in _query_iterator
    data = cursor.fetchmany(chunksize)
sqlite3.OperationalError: SQL logic error or missing database
  1. If I then rerun the script, with the same new table name, it runs for each chunk, and an extra chunk, +1 row.

  2. When the df.to_sql() line is commented out, the loop runs for the expected number of chunks.

Test Example of Issue with Complete Code:

Complete Code: example.py

import pandas as pd
import sqlite3

#Helper Functions Used in Example
def ren(invar, outvar, df):
    df.rename(columns={invar:outvar}, inplace=True)
    return(df)

def count_result(c, table):
    ([print("[*] total: {:,} rows in {} table"
        .format(r[0], table)) 
        for r in c.execute("SELECT COUNT(*) FROM {};".format(table))])


#Connect to Data
db = sqlite3.connect("test.db")
c = db.cursor()
new_table = "new_table"

#Load Data in Chunks
df_generator = pd.read_sql_query("select * from test_table limit 10000;", con=db, chunksize = 5000)

for df in df_generator:
    #Functions to modify data, example
    df = ren("name", "renamed_name", df)
    print(df.shape)
    df.to_sql(new_table, con=db, if_exists = "append", index=False)


#Count if new table is created
try:
    count_result(c, new_table)
except:
    pass

1. Result when #df.to_sql(new_table, con=db, if_exists = "append", index=False)

(the problem line is commented out):

$ python3 example.py 
(5000, 22)
(5000, 22)

Which I expect since the example code limits my large table to 10k rows.

2. Result when df.to_sql(new_table, con=db, if_exists = "append", index=False)

a. the problem line is not commented out

b. this is the first time the code is run with a new_table:

$ python3 example.py 
(5000, 22)
Traceback (most recent call last):
  File "example.py", line 23, in <module>
    for df in df_generator:
  File "/usr/local/lib/python3.5/site-packages/pandas/io/sql.py", line 1420, in _query_iterator
    data = cursor.fetchmany(chunksize)
sqlite3.OperationalError: SQL logic error or missing database

3. Result when df.to_sql(new_table, con=db, if_exists = "append", index=False)

a. the problem line is not commented out

b. the above code is run a second time with the new_table:

$ python3 example.py 
(5000, 22)
(5000, 22)
(5000, 22)
(1, 22)
[*] total: 20,001 rows in new_table table

Thus, I have the issue of first the code breaking when run the first time (Result 2) and second, the total number of rows when run the second time (Result 3) is more than double what I expected.

Any suggestions on how I can solve this issue would be greatly appreciated.

  • Do you simply want to rename a column(s) in an SQLite table? – MaxU - stand with Ukraine Feb 28 '18 at 22:56
  • @MaxU No, the rename function is simply an example modification. I am wanting to do a number of complex operations across multiple columns that would be easier to do in Pandas than SQL. – data-cruncher524 Feb 28 '18 at 23:05
  • Try this: `db = sqlite3.connect("test.db", isolation_level=None)` – MaxU - stand with Ukraine Feb 28 '18 at 23:09
  • @MaxU, excellent, that works the first time and produces the expected result! I do notice that whereas the initial (failing code) would run almost instantaneously the 2nd attempt, with your solution, the code has a lag almost as if there as a `time.sleep(8)` for every chunk. Was there a way to speed this up, or should I just increase the chunk size as memory on my machine allows? – data-cruncher524 Feb 28 '18 at 23:18

3 Answers3

1

You may try to specify:

db = sqlite3.connect("test.db", isolation_level=None)
#  ---->                        ^^^^^^^^^^^^^^^^^^^^

Beside that you may try to increase your chunksize, because otherwise the time between commits is way to short for SQLite DB - that is causing this error, i guess... I would also recommend to use PostgreSQL, MySQL/MariaDB or something similar - they are much more reliable and appropriate for such DB size...

MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
1

Time Delay in Above Solution

@MaxU's solution adding isolation_level=None to the database connection is short and sweet. For whichever reason, however, it slowed writing/committing each chunk to the database dramatically. For example, when I tested the solution on a table of 12 million rows, the code took over 6 hours to complete. Conversely, building the original table from several text files took a few minutes.

This insight led to a faster but less elegant solution, which took less than 7 minutes to complete on a table of 12 million rows versus over 6 hours. The output rows matched the input rows, solving the problem in my original question.

Faster but less Elegant Solution

Since constructing the original table from text files/csv files and using SQL scripts to load the data, I combined that approach with Panda's chunk capabilities. The essential basic steps are as follows:

  1. Connect to the db
  2. Use a SQL script to create a new table (columns and order should match whatever you do to the pandas df)
  3. Read the massive table in chunks
  4. For each chunk, modify the df as desired, write to csv, load csv using sql, and commit the change.

Main code of solution:

import pandas as pd
import sqlite3

#Note I Used Functions I Wrote in build_db.py
#(shown below after example solution)
from build_db import *


#Helper Functions Used in Example
def lower_var(var, df):
    s = df[var].str.lower()
    df = df.drop(var, axis=1)
    df = pd.concat([df, s], axis=1)
    return(df)


#Connect to Data
db = sqlite3.connect("test.db")
c = db.cursor()

#create statement
create_table(c, "create_test.sql", path='sql_clean/')

#Load Data in Chunks
df_generator = pd.read_sql_query("select * from example_table;", con=db, chunksize = 100000)

for df in df_generator:
    #functions to modify data, example
    df = lower_var("name", df) #changes column order

    #restore df to column order in sql table
    db_order = ["cmte_id", "amndt_ind", "rpt_tp", "transaction_pgi", "image_num", "transaction_tp", \
        "entity_tp", "name", "city", "state", "zip_code", "employer", "occupation", "transaction_dt", \
        "transaction_amt", "other_id", "tran_id", "file_num", "memo_cd", "memo_text", "sub_id"]
    df = df[db_order]

    #write chunk to csv
    file = "df_chunk.csv"
    df.to_csv(file, sep='|', header=None, index=False)

    #insert chunk csv to db
    insert_file_into_table(c, "insert_test.sql", file, '|', path='sql_clean/')
    db.commit()


#Count results
count_result(c, "test_indiv")

Imported User Functions for Above Code

#Relavant Functions in build_db.py

def count_result(c, table):
    ([print("[*] total: {:,} rows in {} table"
        .format(r[0], table)) 
        for r in c.execute("SELECT COUNT(*) FROM {};".format(table))])

def create_table(cursor, sql_script, path='sql/'):
    print("[*] create table with {}{}".format(path, sql_script))
    qry = open("{}{}".format(path, sql_script), 'rU').read()
    cursor.executescript(qry)


def insert_file_into_table(cursor, sql_script, file, sep=',', path='sql/'):
    print("[*] inserting {} into table with {}{}".format(file, path, sql_script))
    qry = open("{}{}".format(path, sql_script), 'rU').read()
    fileObj = open(file, 'rU', encoding='latin-1')
    csvReader = csv.reader(fileObj, delimiter=sep, quotechar='"')

    try:
        for row in csvReader:
            try:
                cursor.execute(qry, row)
            except sqlite3.IntegrityError as e:
                pass

    except Exception as e:
        print("[*] error while processing file: {}, error code: {}".format(file, e))
        print("[*] sed replacing null bytes in file: {}".format(file))
        sed_replace_null(file, "clean_null.sh")
        subprocess.call("bash clean_null.sh", shell=True)

        try:
            print("[*] inserting {} into table with {}{}".format(file, path, sql_script))
            fileObj = open(file, 'rU', encoding='latin-1')
            csvReader = csv.reader(fileObj, delimiter=sep, quotechar='"')
            for row in csvReader:
                try:
                    cursor.execute(qry, row)
                except sqlite3.IntegrityError as e:
                    pass
                    print(e)    

        except Exception as e:
            print("[*] error while processing file: {}, error code: {}".format(file, e))

SQL User Scripts

--create_test.sql

DROP TABLE if exists test_indiv;

CREATE TABLE test_indiv (
    cmte_id TEXT NOT NULL,
    amndt_ind TEXT,
    rpt_tp TEXT,
    transaction_pgi TEXT,
    image_num TEXT,
    transaction_tp TEXT,
    entity_tp TEXT,
    name TEXT,
    city TEXT,
    state TEXT,
    zip_code TEXT,
    employer TEXT,
    occupation TEXT,
    transaction_dt TEXT,
    transaction_amt TEXT,
    other_id TEXT,
    tran_id TEXT,
    file_num NUMERIC,
    memo_cd TEXT,
    memo_text TEXT,
    sub_id NUMERIC NOT NULL
);

CREATE UNIQUE INDEX idx_test_indiv ON test_indiv (sub_id);
--insert_test.sql

INSERT INTO test_indiv (
    cmte_id,
    amndt_ind,
    rpt_tp,
    transaction_pgi,
    image_num,
    transaction_tp,
    entity_tp,
    name,
    city,
    state,
    zip_code,
    employer,
    occupation,
    transaction_dt,
    transaction_amt,
    other_id,
    tran_id,
    file_num,
    memo_cd,
    memo_text,
    sub_id
    ) 
VALUES (
    ?,
    ?,
    ?,
    ?,
    ?,
    ?,
    ?,
    ?,
    ?,
    ?,
    ?,
    ?,
    ?,
    ?,
    ?,
    ?,
    ?,
    ?,
    ?,
    ?,
    ?
);
  • A caveat to the above is that the phenomenon of extra inserts from the chunking is not actually resolved, but the total number of rows in the final table is the same, *if* the create sql script executed has a statement to create a unique index. If there is not a unique index, the code is problematic. – data-cruncher524 Mar 02 '18 at 17:51
0

Experienced the exact same problem (dealing with > 30 GB data). Here is how I tackled the problem: Instead of using the Chunk feature of read_sql. I decided to create a manual chunk looper like so:

chunksize=chunk_size
offset=0
for _ in range(0, a_big_number):
    query = "SELECT * FROM the_table %s offset %s" %(chunksize, offset)
    df = pd.read_sql(query, conn)
    if len(df)!=0:
        ....
    else:
        break
Greg
  • 63
  • 1
  • 9