1

I am storing two different pandas DataFrames as parquet files (through kedro).

Both DataFrames have identical dimensions and dtypes (float32) before getting written to disk. Also, their memory consumption in RAM is identical:

distances_1.memory_usage(deep=True).sum()/1e9
# 3.730033604
distances_2.memory_usage(deep=True).sum()/1e9
# 3.730033604

When persisted as .parquet files, the first df results in a file of ~0.89GB and the second file results in a file ~4.5GB.

distances_1 has many more redundant values than distances_2 and thus compression might be more effective.

Loading the parquet files from disk into DataFrames results in valid data that is identical to the original DataFrames.

  • How can the big size difference between the files be explained?
  • For what reasons could the second file be larger than the in-memory data structure?
Nils Blum-Oeste
  • 5,608
  • 4
  • 25
  • 26
  • Wouldn't it be less confusing to translate RAM into usual units? – Wolf Mar 16 '21 at 09:08
  • 1
    The code provided returns the total memory consumption of the dataframe in GB, right? I thought that would make it easy to compare it to the file sizes. – Nils Blum-Oeste Mar 16 '21 at 09:16
  • 1
    I see, it's only that sometimes (like in Windows Explorer) the unit `1GB` means `2^30 Bytes`. – Wolf Mar 16 '21 at 09:27
  • Is this "many more redundant values" measurable in some way? – Wolf Mar 16 '21 at 09:30
  • Of course you are right about the GB, the division by 1e9 is just an approximation, but I don't think this is crucial to the issue, is it? – Nils Blum-Oeste Mar 16 '21 at 10:03
  • ...just a tiny detail you may be familiar with: differences in presentation style build a distance between things, compare the relation between *sixteen* and *8*, so it's more a psychological issue. Is there anything new about the actual problem? I mean (2) file representation tend to be larger when references need to be stored. The possibility for optimizations (1) sometimes depend on size (a bitset can be stored in a 64-bit machine word at runtime). Python hashes small numbers. UTF-8 is very efficient for 7-bit charsets. BTW: is there a difference in the origin of the two files? – Wolf Mar 16 '21 at 10:51

2 Answers2

4

As you say, the number of unique values can have a very important role in parquet size.

Translating from pandas, two other factors that can have a surprisingly large effect on parquet file size are:

  1. pandas indexes, which are saved by default even if they're just auto-assigned;
  2. the sorting of your data, which can make a large difference in the run-length encoding parquet sometimes uses.

Shuffled, auto-assigned indices can take a lot of space. If you don't care about the sort order of data on disk, worrying about this can make a significant difference.

Consider four cases of a pandas frame with one column containing the same data in all cases: the rounded squares of the first 2**16 integers. Storing it without indexes in sorted form takes 2.9K; shuffled without the auto-assigned index takes 66K; auto-assigning an index then shuffling takes 475K.

import pandas as pd
import numpy as np
!mkdir -p /tmp/parquet
d = pd.DataFrame({"A": np.floor(np.sqrt(np.arange(2**16)))})

d.to_parquet("/tmp/parquet/straight.parquet")
d.to_parquet("/tmp/parquet/straight_no_index.parquet", index = False)
d.sample(frac = 1).to_parquet("/tmp/parquet/shuf.parquet")
d.sample(frac = 1).to_parquet("/tmp/parquet/shuf_no_index.parquet", index = False)
ls -lSh /tmp/parquet
-rw-r--r--  1 user  wheel   475K Mar 18 13:39 shuf.parquet
-rw-r--r--  1 user  wheel    66K Mar 18 13:39 shuf_no_index.parquet
-rw-r--r--  1 user  wheel   3.3K Mar 18 13:39 straight.parquet
-rw-r--r--  1 user  wheel   2.9K Mar 18 13:39 straight_no_index.parquet
Wolf
  • 9,679
  • 7
  • 62
  • 108
Ben Schmidt
  • 116
  • 1
  • 1
  • Great information, thanks. Writing without indexes makes a tiny difference in my case (much less than 1 promille) but doesn't explain the observed big differences in file sizes at all. I also need to preserve the order of rows (especially when dropping the indexes). – Nils Blum-Oeste Mar 24 '21 at 16:10
  • Do you have an idea why it might be larger on disk than in memory? – Nils Blum-Oeste Mar 24 '21 at 16:23
  • 1
    It's hard to think of any without knowing the shape of the data and the engine As Wolf said, maybe it's something that Python is hashing internally. A couple other possibilities: – Ben Schmidt Mar 25 '21 at 19:11
  • 1
    (continuing...) Maybe some of the additional data that parquet can store beyond pandas internals (e.g., the per-page index hints) are out of whack. Maybe somehow the encoding is actually losing space, which can happen; or you have table level information that df.memory_usage(deep=True) isn't measuring. (If your column names, e.g., are each a million characters long, they'll add a lot to file size but not to df.memory_usage(), which doesn't consider them). – Ben Schmidt Mar 25 '21 at 19:18
1

From a Kedro point of view this is just calling the PyArrow library write_table function doucmented here. Any of these parameters are available by the save_args argument in the catalog definition and may be worth playing around with?

datajoely
  • 1,466
  • 10
  • 13