Optimising / minimizing a local copy of the DB table using python (Currently using Pickle)

Question

I am establishing a connection to a DB hosted on RedShift and want to save the table locally as a pickle file (or any other way) to save time.

When I am saving 100k records, it's taking 163 MBs of space and the total number of records ~ 14 million, so it would take around 165 GBs for the entire table.

Whereas one of my colleagues saved the table as .rds file through R and it only took ~500 MBs for the entire 14 million records.

How can I save space and make things faster in python? Why is rds format taking such low space?

Here's a sample of my code (Using Python 2.7.11, cPickle, protocol=cPickle.HIGHEST_PROTOCOL)

def write_pickle(folder, filename, my_obj, overwrite = False):
    filepath = os.path.join(folder, filename)
    if overwrite:
        print('Writing into pickle file...')
        with gzip.open(filepath, 'wb') as handle:
            pickle.dump(my_obj, handle, protocol=cPickle.HIGHEST_PROTOCOL)
    else:
        if os.path.isfile(filepath) == False:
            print('Writing into pickle file...')
            with gzip.open(filepath, 'wb') as handle:
                pickle.dump(my_obj, handle, protocol=cPickle.HIGHEST_PROTOCOL)

connection_obj = form_connection_obj()
cursor = connection_obj.cursor()
cursor.execute('SELECT TOP 100000 * FROM my_dataset;')

write_pickle('folder', 'filename.p', list(cursor.fetchall()))

UPDATE 1: After converting to cPickle from pickle and using the HIGHEST_PROTOCOL the size reduced to half (~83 MBs for 100k records) but it is still very large compared to the R counterpart.

UPDATE 2: After using gzip the size was reduced (~17 MB for 100k records) but it is still more than 3 times R's .rds format and slower too.

This question would be little different to the one suggested since

1) I am not fixated on using pickle, I just want to have a local dump of data to avoid remote DB connection everytime.

2) I want to find the best way to do this (it may not even involve dumping the table locally).

3) Solution suggested in the other question using gzip has made loading the pickle considerably slow.

What protocol of pickle are you using? What is your python version? — S. de Melo, Oct 14 '16 at 11:12
Possible duplicate of [Decreasing the size of cPickle objects](http://stackoverflow.com/questions/18474791/decreasing-the-size-of-cpickle-objects) — S. de Melo, Oct 14 '16 at 11:16
I am on Python 3.5.2. Could be a possible duplicate but never got that search result. Strange!! By how much would zipping compress the data? — Shivendra, Oct 14 '16 at 11:19
Have you tried to compress the file after the pickling (with bz2, gzip or lzma)? — S. de Melo, Oct 14 '16 at 11:42
Yes, I tried `gzip`.It produced a ~17 MB file for 100k records but the loading was so slow that I might as well fetch the 100k records every time. — Shivendra, Oct 14 '16 at 11:44
@S.deMelo I am using `2.7.11`. By mistake, I mentioned my local development version — Shivendra, Oct 14 '16 at 12:02

Optimising / minimizing a local copy of the DB table using python (Currently using Pickle)

0 Answers0