Export 2Gb+ SELECT to CSV with Python (out of Memory)

Question

I'm trying to export a large file from Netezza (using Netezza ODBC + pyodbc), this solution throws memoryError, If I loop without "list" it's VERY slow. do you have any idea of a intermediate solution that doesn't kill my server/python process but can run faster?

cursorNZ.execute(sql)
archi = open("c:\test.csv", "w")
lista = list(cursorNZ.fetchall())
for fila  in lista:
    registro = ''
    for campo in fila:
        campo = str(campo)
        registro = registro+str(campo)+";"
    registro = registro[:-1]
    registro = registro.replace('None','NULL')
    registro = registro.replace("'NULL'","NULL")
    archi.write(registro+"\n")

---- Edit ----

Thank you, I'm trying this: Where "sql" is the query, cursorNZ is

connMy = pyodbc.connect(DRIVER=.....)
cursorNZ = connNZ.cursor()

chunk = 10 ** 5  # tweak this
chunks = pandas.read_sql(sql, cursorNZ, chunksize=chunk)
with open('C:/test.csv', 'a') as output:
    for n, df in enumerate(chunks):
        write_header = n == 0
        df.to_csv(output, sep=';', header=write_header, na_rep='NULL')

Have this: AttributeError: 'pyodbc.Cursor' object has no attribute 'cursor' Any idea?

Possible duplicate of http://stackoverflow.com/questions/17707264/iterating-over-pyodbc-result-without-fetchall especially the reference to [fetchmany](http://code.google.com/p/pyodbc/wiki/Cursor#fetchmany). — tdelaney, Oct 05 '16 at 15:45
pass `read_sql` your connection instead. I'll edit my answer to reflect this — Paul H, Oct 05 '16 at 17:15

Paul H · Accepted Answer · 2016-10-05T20:14:51.907

7

Don't use cursorNZ.fetchall().

Instead, loop through the cursor directly:

with open("c:/test.csv", "w") as archi:  # note the fixed '/'
    cursorNZ.execute(sql)
    for fila in cursorNZ:
        registro = ''
        for campo in fila:
            campo = str(campo)
            registro = registro+str(campo)+";"
        registro = registro[:-1]
        registro = registro.replace('None','NULL')
        registro = registro.replace("'NULL'","NULL")
        archi.write(registro+"\n")

Personally, I would just use pandas:

import pyodbc
import pandas

cnn = pyodbc.connect(DRIVER=.....)
chunksize = 10 ** 5  # tweak this
chunks = pandas.read_sql(sql, cnn, chunksize=chunksize)

with open('C:/test.csv', 'a') as output:
    for n, df in enumerate(chunks):
        write_header = n == 0
        df.to_csv(output, sep=';', header=write_header, na_rep='NULL')

edited Oct 05 '16 at 20:14

answered Oct 05 '16 at 15:44

Paul H

65,268
20
159
136

4

The `pandas` solution would have a large memory footprint. – tdelaney Oct 05 '16 at 15:47
@tdelaney `read_sql` can take a `chunksize` argument that could help manage that. – Paul H Oct 05 '16 at 15:48
3

But you still have a 2gb+ dataframe in memory. There is a speed advantage to doing it in `pandas` of course so maybe some outer loop that builds smaller frames. – tdelaney Oct 05 '16 at 15:56
@tdelaney I think the latest edit addresses your concerns too – Paul H Oct 05 '16 at 16:00
It works and it's MUCH faster than the windows client for exporting data. There is an error here: **chunksize** = 10 ** 5 and the use of the variable in next line (I used "chunk" in both) Thank you! – Alejandro Oct 05 '16 at 18:33

Export 2Gb+ SELECT to CSV with Python (out of Memory)

1 Answers1