Hadoop Parquet Datastorewriter bad writing performance

Question

I´m writing Parquet files using the ParquetDatasetStoreWriterclass and the performance I get is really bad. Normally the flow followed is this:

// First write
dataStoreWriter.write(entity #1);
dataStoreWriter.write(entity #2);
...
dataStoreWriter.write(entity #N);

// Then close
dataStoreWriter.close()

The problem is, as you might know, that my dataStoreWriter is one a facade and the real writing work is done by a taskExecutor and a taskScheduler. This work can be seen by these messages prompted to the standard output:

INFO: parquet.hadoop.ColumnChunkPageWriteStore: written 685B for [localId] BINARY: 300,000 values, ...
INFO: parquet.hadoop.ColumnChunkPageWriteStore: written 75B for [factTime] INT64: 300,000 values, ...
INFO: parquet.hadoop.ColumnChunkPageWriteStore: written 50B for [period] INT32: 300,000 values, ...
INFO: parquet.hadoop.ColumnChunkPageWriteStore: written 6,304B for [objectType] BINARY: 300,000 values, ...

As you can see, I am writing 300K objects per Parquet file, which results in files of around 700K in disk. Nothing really big... However, after one or two writes, I get fewer and fewer messages like these ones and the process stalls...

Any idea about what could be happening? Everything is green in Cloudera...

Versions used:

Cloudera 5.11
Java 8
Spring Integration 4.3.12.RELEASE
Spring Data Hadoop 2.2.0.RELEASE

Edit: Actually, I isolated the writing of the Parquet files using the Kite Dataset CLI tool and the problem is the performance of the SKD itself. Using the csv-import command and loading the data from a CSV, I see that we are writing at a rate of 400.000 records per minute, which is way below than the 15.0000.000 records per minute that we are writing, hence the stalling...

Can you recommend any way of improving this writing rate? Thanks!

Hadoop Parquet Datastorewriter bad writing performance

0 Answers0