I would like to serialize my DataFrame. The DataFrame uses 10.1 GB of memory and has 59 million entries.
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 59181090 entries, 0 to 59181089
Data columns (total 22 columns):
(...)
dtypes: float64(1), int64(9), object(12)
memory usage: 10.1+ GB
When I serialize the DataFrame with feather and then re-import the serialized DataFrame, it appears to corrupt the DataFrame.
df.("raw_df.feather")
unserialized_df = pd.read_feather("raw_df.feather")
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22909623 entries, 0 to 22909622
Data columns (total 22 columns):
(...)
dtypes: float64(2), int64(8), object(12)
memory usage: 3.8+ GB
It also introduces a small number of NaN values where there were none before.
What's the best way to serialize a large DataFrame?
I'm using an ml.m4.10xlarge AWS instance with SageMaker with a JupyterLab interface. I have 30GB of storage available with 4GB used so I should be affected by a storage limitation.
I have 160GiB of main memory so I should not have an issue with handling the whole DataFrame.
I am using Pandas 0.24.2 with Python 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56) [GCC 7.2.0].