I am storing two different pandas DataFrames as parquet files (through kedro).
Both DataFrames have identical dimensions and dtypes (float32
) before getting written to disk. Also, their memory consumption in RAM is identical:
distances_1.memory_usage(deep=True).sum()/1e9
# 3.730033604
distances_2.memory_usage(deep=True).sum()/1e9
# 3.730033604
When persisted as .parquet
files, the first df results in a file of ~0.89GB and the second file results in a file ~4.5GB.
distances_1
has many more redundant values than distances_2
and thus compression might be more effective.
Loading the parquet files from disk into DataFrames results in valid data that is identical to the original DataFrames.
- How can the big size difference between the files be explained?
- For what reasons could the second file be larger than the in-memory data structure?