I have a process that generates millions of small dataframes and save them to parquet in parallel.
all dataframes have the same columns and index information. and have the same number of rows (about 300).
as the dataframe is small, when they are saved into parquet files, the meta information is quite big in comparison with the data. as the meta information for each parquet file is basically the same, the disk space is wasted because the same meta are repeated millions of times.
is it possible to save one copy of meta information and other parquet files contains only the data ? when I need to read a dataframe, read the meta and the data from two different files?
some updates:
concating them into one big dataframe can save the disk space, but it's not friendly to do parallel processing of each small dataframe.
I also tried other format like feather.but it seems that feather does not store data as effciently as parquet. the file size is smaller but it's larger than parquet meta + parquet data