28

I am loading about 2 - 2.5 million records into a Postgres database every day.

I then read this data with pd.read_sql to turn it into a dataframe and then I do some column manipulation and some minor merging. I am saving this modified data as a separate table for other people to use.

When I do pd.to_sql it takes forever. If I save a csv file and use COPY FROM in Postgres, the whole thing only takes a few minutes but the server is on a separate machine and it is a pain to transfer files there.

Using psycopg2, it looks like I can use copy_expert to benefit from the bulk copying, but still use python. I want to, if possible, avoid writing an actual csv file. Can I do this in memory with a pandas dataframe?

Here is an example of my pandas code. I would like to add the copy_expert or something to make saving this data much faster if possible.

    for date in required_date_range:
        df = pd.read_sql(sql=query, con=pg_engine, params={'x' : date})
        ...
        do stuff to the columns
        ...
        df.to_sql('table_name', pg_engine, index=False, if_exists='append',  dtype=final_table_dtypes)

Can someone help me with example code? I would prefer to use pandas still and it would be nice to do it in memory. If not, I will just write a csv temporary file and do it that way.

Edit- here is my final code which works. It only takes a couple of hundred seconds per date (millions of rows) instead of a couple of hours.

to_sql = """COPY %s FROM STDIN WITH CSV HEADER"""

def process_file(conn, table_name, file_object):
    fake_conn = cms_dtypes.pg_engine.raw_connection()
    fake_cur = fake_conn.cursor()
    fake_cur.copy_expert(sql=to_sql % table_name, file=file_object)
    fake_conn.commit()
    fake_cur.close()


#after doing stuff to the dataframe
    s_buf = io.StringIO()
    df.to_csv(s_buf) 
    process_file(cms_dtypes.pg_engine, 'fact_cms_employee', s_buf)
trench
  • 5,075
  • 12
  • 50
  • 80
  • 1
    I don't know psycopg2 but you could try something like: `s_buf = io.StringIO()`, `df.to_csv(s_buf)`, which will store your df in a file-like buffer. Then maybe `cur.copy_from(s_buf,...)` instead of `copy_expert`. – ptrj Jul 05 '16 at 16:42
  • The strongio worked! I still kept copy expert though. It only took like 100 seconds versus 10000 seconds when I was using just plain pandas.to_sql. Make a real answer so I can accept – trench Jul 05 '16 at 20:14
  • Glad I could help. – ptrj Jul 05 '16 at 22:01

3 Answers3

52

Python module io(docs) has necessary tools for file-like objects.

import io

# text buffer
s_buf = io.StringIO()

# saving a data frame to a buffer (same as with a regular file):
df.to_csv(s_buf)

Edit. (I forgot) In order to read from the buffer afterwards, its position should be set to the beginning:

s_buf.seek(0)

I'm not familiar with psycopg2 but according to docs both copy_expert and copy_from can be used, for example:

cur.copy_from(s_buf, table)

(For Python 2, see StringIO.)

ptrj
  • 5,152
  • 18
  • 31
  • Thanks. Also, I am looping through each date to query for that date. Each time I am connecting to the database again. Is there a better way to not have to connect/reconnect each loop? Like I would connect once, and then I would just change the query during my loops? – trench Jul 06 '16 at 00:16
  • I'm not sure if I understand but can't you just query for the whole stuff (all dates in the range) once. If this is too large, then maybe query in chunks. I guess it would add another column with dates to the data frame. Then you could either drop this column if you don't need it or select and work with subframes, or `groupby` by dates and iterate over groups. Or, if you want to avoid `pd.read_sql` completely, maybe copy data with `copy_expert`/`copy_to` to a string buffer and load it to a data frame with `pd.read_csv`. It's just from the top of my head. – ptrj Jul 06 '16 at 00:58
  • `AttributeError: '_io.StringIO' object has no attribute 'write_cells' ` – Pyd Aug 30 '18 at 05:35
6

I had problems implementing the solution from ptrj.

I think the issue stems from pandas setting the pos of the buffer to the end.

See as follows:

from StringIO import StringIO
df = pd.DataFrame({"name":['foo','bar'],"id":[1,2]})
s_buf = StringIO()
df.to_csv(s_buf)
s_buf.__dict__

# Output
# {'softspace': 0, 'buflist': ['foo,1\n', 'bar,2\n'], 'pos': 12, 'len': 12, 'closed': False, 'buf': ''}

Notice that pos is at 12. I had to set the pos to 0 in order for the subsequent copy_from command to work

s_buf.pos = 0
cur = conn.cursor()
cur.copy_from(s_buf, tablename, sep=',')
conn.commit()
a_bigbadwolf
  • 191
  • 1
  • 6
  • I ran that code over the weekend and didn't get any errors. When I got to the office on Monday hough my tables were empty which sucked. I ended up writing temporary csv files which does work completely. So if I do to_csv(s_buf) and then s_buf.pos=0 then it will work without writing a csv? – trench Jul 19 '16 at 20:48
  • 2
    @a_bigbadwolf @trench Good point! It's a standard behaviour of file/stream buffers (i.e. not the fault of pandas). I just forgot to include it, my apologies. An idiomatic way to set the position to the beginning is to run `s_buf.seek(0)`. – ptrj Jul 20 '16 at 00:35
1

The pandas.DataFrame API (since v1.0) will output a string if the file object is not specified. For example:

df = pd.DataFrame([{'x': 1, 'y': 1}, {'x': 2, 'y': 4}, {'x': 3, 'y': 9}])

# outputs to a string
csv_as_string = df.to_csv(index=False)
print(repr(csv_as_string))  # prints 'x,y\r\n1,1\r\n2,4\r\n3,9\r\n' (on windows)

# outputs to a file
with open('example.csv', 'w', newline='') as f:
    df.to_csv(f, index=False)  # writes to file, returns None

From the current (v1.4.3) docs:

path_or_buf : str, path object, file-like object, or None, default None
String, path object (implementing os.PathLike[str]), or file-like object implementing a write() function. If None, the result is returned as a string. If a non-binary file object is passed, it should be opened with newline=’’, disabling universal newlines. If a binary file object is passed, mode might need to contain a ‘b’.

averykhoo
  • 13
  • 3