I am establishing a connection to a DB hosted on RedShift and want to save the table locally as a pickle file
(or any other way) to save time.
When I am saving 100k records, it's taking 163 MBs of space and the total number of records ~ 14 million, so it would take around 165 GBs for the entire table.
Whereas one of my colleagues saved the table as .rds
file through R
and it only took ~500 MBs for the entire 14 million records.
How can I save space and make things faster in python? Why is rds format taking such low space?
Here's a sample of my code (Using Python 2.7.11, cPickle, protocol=cPickle.HIGHEST_PROTOCOL)
def write_pickle(folder, filename, my_obj, overwrite = False):
filepath = os.path.join(folder, filename)
if overwrite:
print('Writing into pickle file...')
with gzip.open(filepath, 'wb') as handle:
pickle.dump(my_obj, handle, protocol=cPickle.HIGHEST_PROTOCOL)
else:
if os.path.isfile(filepath) == False:
print('Writing into pickle file...')
with gzip.open(filepath, 'wb') as handle:
pickle.dump(my_obj, handle, protocol=cPickle.HIGHEST_PROTOCOL)
connection_obj = form_connection_obj()
cursor = connection_obj.cursor()
cursor.execute('SELECT TOP 100000 * FROM my_dataset;')
write_pickle('folder', 'filename.p', list(cursor.fetchall()))
UPDATE 1: After converting to cPickle
from pickle
and using the HIGHEST_PROTOCOL
the size reduced to half (~83 MBs for 100k records) but it is still very large compared to the R
counterpart.
UPDATE 2: After using gzip
the size was reduced (~17 MB for 100k records) but it is still more than 3 times R's .rds
format and slower too.
This question would be little different to the one suggested since
1) I am not fixated on using pickle, I just want to have a local dump of data to avoid remote DB connection everytime.
2) I want to find the best way to do this (it may not even involve dumping the table locally).
3) Solution suggested in the other question using gzip
has made loading the pickle considerably slow.