I'm trying to read fixed width files and bulk load data to various tables on a postgres server. One file per one table of course.
I've wrote a function that creates the table for me already. I just need to populate it.
After some reading the docs and other research I've found two useful commands: "copy_from" within psycopg2 and ".to_sql" from pandas. I successfully created an implementation that uses the latter. The table looks great. The only issue is, it takes about 14 minutes to upload a file that is 110,000 rows and 100 columns.
The former approach is apparently a much faster method. I just cannot make it work yet.
Below is the code I have thus far:
import pandas as pd
import psycopg2 as pg
def sql_gen(cur,conn,filename,widths,col_names,table,header=False,chunk_size=10**5):
df = pd.read_fwf(filename,widths=widths,index_col=False,header=None,iterator=True,
chunksize=chunk_size,names=cols)
for chunk in df:
cur.copy_from(df,table,null="")
yield conn.commit()
#parameters
data_path = 'data.dat'
table = 'example_table'
#some stuff to extract stuff we need
widths = getwidths
cols = getcols
#main part of script
conn = None
try:
conn = pg.connect('connectionstring')
cursor = conn.cursor()
for sql in sql_gen(cursor,conn,data_path,widths,cols,table):
print(sql)
# close communication with the PostgreSQL database server
cursor.close()
except (Exception, pg.Error) as error :
print(error)
finally:
#closing database conn.
if conn is not None:
conn.close()
print("PostgreSQL conn is closed")
I expected this to work but I am instead getting a TypeError:
argument 1 must have both .read() and .readline() methods
Full traceback as requested:
File "<ipython-input-10-542d72b61dd4>", line 4, in <module>
for sql in sql_gen(cursor,conn,data_path,widths,cols,table):
File "<ipython-input-8-f82fb5831db3>", line 7, in sql_generator
cur.copy_from(df,'example_table',null="")```