2

I would like to serialize my DataFrame. The DataFrame uses 10.1 GB of memory and has 59 million entries.

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 59181090 entries, 0 to 59181089
Data columns (total 22 columns):
(...)
dtypes: float64(1), int64(9), object(12)
memory usage: 10.1+ GB

When I serialize the DataFrame with feather and then re-import the serialized DataFrame, it appears to corrupt the DataFrame.

df.("raw_df.feather")

unserialized_df = pd.read_feather("raw_df.feather")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22909623 entries, 0 to 22909622
Data columns (total 22 columns):
(...)
dtypes: float64(2), int64(8), object(12)
memory usage: 3.8+ GB

It also introduces a small number of NaN values where there were none before.

What's the best way to serialize a large DataFrame?

I'm using an ml.m4.10xlarge AWS instance with SageMaker with a JupyterLab interface. I have 30GB of storage available with 4GB used so I should be affected by a storage limitation.

I have 160GiB of main memory so I should not have an issue with handling the whole DataFrame.

I am using Pandas 0.24.2 with Python 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56) [GCC 7.2.0].

Campbell Hutcheson
  • 549
  • 2
  • 4
  • 12

2 Answers2

1

Try using dask.

import dask.dataframe as dd
unserialized_df = dd.read_feather("raw_df.feather").compute()

Source: Load many feather files in a folder into dask

0

Another option is to use an efficient binary format that serializes tabular data, called BinTableFile Github source code. It indexes the data in the file and is efficient on reading from the middle of the file by the integer index, and is written in Cython for speed efficiency.

asu
  • 539
  • 6
  • 15