I´m writing Parquet files using the ParquetDatasetStoreWriter
class and the performance I get is really bad.
Normally the flow followed is this:
// First write
dataStoreWriter.write(entity #1);
dataStoreWriter.write(entity #2);
...
dataStoreWriter.write(entity #N);
// Then close
dataStoreWriter.close()
The problem is, as you might know, that my dataStoreWriter
is one a facade and the real writing work is done by a taskExecutor
and a taskScheduler
. This work can be seen by these messages prompted to the standard output:
INFO: parquet.hadoop.ColumnChunkPageWriteStore: written 685B for [localId] BINARY: 300,000 values, ...
INFO: parquet.hadoop.ColumnChunkPageWriteStore: written 75B for [factTime] INT64: 300,000 values, ...
INFO: parquet.hadoop.ColumnChunkPageWriteStore: written 50B for [period] INT32: 300,000 values, ...
INFO: parquet.hadoop.ColumnChunkPageWriteStore: written 6,304B for [objectType] BINARY: 300,000 values, ...
As you can see, I am writing 300K objects per Parquet file, which results in files of around 700K in disk. Nothing really big... However, after one or two writes, I get fewer and fewer messages like these ones and the process stalls...
Any idea about what could be happening? Everything is green in Cloudera...
Versions used:
- Cloudera 5.11
- Java 8
- Spring Integration 4.3.12.RELEASE
- Spring Data Hadoop 2.2.0.RELEASE
Edit: Actually, I isolated the writing of the Parquet files using the Kite Dataset CLI tool and the problem is the performance of the SKD itself. Using the csv-import
command and loading the data from a CSV, I see that we are writing at a rate of 400.000 records per minute, which is way below than the 15.0000.000 records per minute that we are writing, hence the stalling...
Can you recommend any way of improving this writing rate? Thanks!