0

I am using the below code to read a tab delimited file into a postgres database

enginestring = cfg.dbuser+":"+cfg.dbpwd+"@"+server.local_bind_host+":"+str(server.local_bind_port)+"/"+cfg.dbname
engine = create_engine('postgresql://' + enginestring)

rows = []
for line in smart_open.smart_open(key):
    ln = str(line.decode('utf-8'))
    fields = ln.split('\t')
    rows.append(fields)

df = pd.DataFrame(rows, columns=cfg.df_colheaders)
print(df.head)
df.to_sql(name=table_name, con=engine, if_exists='append')

The call to print returns the dataframe that I expect (i.e. [798624 rows x 133 columns]) and the call to to_sql doesn't fail, yet in the DB I see only one row of data with the correct columns...(same results if the table has been created before or not)

gkennos
  • 371
  • 4
  • 14
  • I believe it's not a connection issue because multiple tables are created by this loop without reconnecting. Could it be a data issue? – gkennos Feb 17 '18 at 04:44
  • Weird. 1. `df.head()` returns something meaningful, correct? 2. What happens when you test your code on a smaller data set, such as `df = df.head()` and then `df.to_sql(name=table_name, con=engine, if_exists='append')` – FatihAkici Feb 17 '18 at 04:58
  • And please add `, index=False)` into `df.to_sql()` to see if that plays a role. – FatihAkici Feb 17 '18 at 05:06

1 Answers1

0

OK here is an update:

  • I solved the single row issue by stripping EOL chars (could see ΒΆ at the end of the last inserted field)
  • Then I was just getting empty tables so I added chunksize parameter to to_sql - not sure why it didn't fail instead of just proceeding but whatever it's OK now
gkennos
  • 371
  • 4
  • 14