Memory error loading a 1GB .accdb using pypyodbc and pandas

Question

I'm trying to do something that maybe is not possible or maybe should be done in a different way...

I have to read a 1 GB Access file and manipulate it in pandas; since cursor.fetchall() failed directly with Memory Error, I tried the function below in order to see when the memory error happens: it appears after 400.000 rows fetched (the total is 1.12 Mrows).

It is strange since I have 8 GB ram in my machine and it seems to be free at 50%. I also set my virtual memory to 16 GB but the result didn't change.

I don't need calculus speed, so any dirty solution is welcome :) including using hard-disk as ram (I have an ssd).

There is maybe a way to make all memory available for python?

Ways that have already failed:

single row fetch: cursor.fetchone()
many rows fetch: cursor.fetchmany()
all rows fetch: cursor.fetchall()
pandas read_sql passing chunksize: pandas.read_sql(query, conn, chunksize=chunksize) (thx to user MaxU)

Function :

def msaccess_to_df (abs_path, query):
    conn = pypyodbc.connect(
        r"Driver={Microsoft Access Driver (*.mdb, *.accdb)};"
        r"Dbq=" + abs_path + ";" )

    cur = conn.cursor()
    cur.execute( query )

    fields = zip(*cur.description)[0]
    df = pandas.DataFrame(columns=fields)

    fetch_lines_per_block = 5000
    i = 0
    while True:
        rows = cur.fetchmany(fetch_lines_per_block) # <-----
        if len(rows) == 0: break
        else:
            rd = [dict(zip(fields, r)) for r in rows]
            df = df.append(rd, ignore_index=True)
            del rows
            del rd
        i+=1
        print 'fetched', i*fetch_lines_per_block, 'lines'

    cur.close()
    conn.close()

    return df

THE ERROR :

df = df.append(rd, ignore_index=True)
  File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 4338, in append
    verify_integrity=verify_integrity)
  File "C:\Python27\lib\site-packages\pandas\tools\merge.py", line 845, in concat
    copy=copy)
  File "C:\Python27\lib\site-packages\pandas\tools\merge.py", line 904, in __init__
    obj.consolidate(inplace=True)
  File "C:\Python27\lib\site-packages\pandas\core\generic.py", line 2747, in consolidate
    self._consolidate_inplace()
  File "C:\Python27\lib\site-packages\pandas\core\generic.py", line 2729, in _consolidate_inplace
    self._protect_consolidate(f)
  File "C:\Python27\lib\site-packages\pandas\core\generic.py", line 2718, in _protect_consolidate
    result = f()
  File "C:\Python27\lib\site-packages\pandas\core\generic.py", line 2727, in f
    self._data = self._data.consolidate()
  File "C:\Python27\lib\site-packages\pandas\core\internals.py", line 3273, in consolidate
    bm._consolidate_inplace()
  File "C:\Python27\lib\site-packages\pandas\core\internals.py", line 3278, in _consolidate_inplace
    self.blocks = tuple(_consolidate(self.blocks))
  File "C:\Python27\lib\site-packages\pandas\core\internals.py", line 4269, in _consolidate
    _can_consolidate=_can_consolidate)
  File "C:\Python27\lib\site-packages\pandas\core\internals.py", line 4292, in _merge_blocks
    new_values = new_values[argsort]
MemoryError

#################### EDIT - SOLVED ####################

Finally I solved with

Setting machine virtual memory to 16GB
Installing Python 64 bit
Replacing Access drivers with the 64 bit one (https://www.microsoft.com/en-us/download/confirmation.aspx?id=13255)

With this any method works.

Ah! Just about to recommend Python 64-bit if your machine allows it. Please post your edit as answer for future readers. — Parfait, Jul 05 '16 at 18:10

score 2 · Answer 1 · answered Jul 04 '16 at 22:55

I would use native pandas method - read_sql() instead of fetching rows manually in loops:

def msaccess_to_df (abs_path, query):
    conn = pypyodbc.connect(
        r"Driver={Microsoft Access Driver (*.mdb, *.accdb)};"
        r"Dbq=" + abs_path + ";" )

    df = pd.read_sql(query, conn)
    conn.close()
    return df

if you are still receiving MemoryError exception, try to read your data in chunks:

def msaccess_to_df (abs_path, query, chunksize=10**5):
    conn = pypyodbc.connect(
        r"Driver={Microsoft Access Driver (*.mdb, *.accdb)};"
        r"Dbq=" + abs_path + ";" )

    df = pd.concat([x for x in pd.read_sql(query, conn, chunksize=chunksize)],
                   ignore_index=True)
    conn.close()
    return df

PS this should give you an idea, but please be aware that i didn't test this code, so it might need some debugging...

thank you, but also with `chunksize` I still get memory error after a while even if I have 3 GB ram free (viewed in windows resource monitor). — DPColombotto, Jul 05 '16 at 07:58

Memory error loading a 1GB .accdb using pypyodbc and pandas

1 Answers1