Set parquet snappy output file size is hive?

Question

I'm trying to split parquet/snappy files created by hive INSERT OVERWRITE TABLE... on dfs.block.size boundary as impala issues a warning when a file in a partition is larger then block size.

impala logs the following WARNINGS:

Parquet files should not be split into multiple hdfs-blocks. file=hdfs://<SERVER>/<PATH>/<PARTITION>/000000_0 (1 of 7 similar)

Code:

CREATE TABLE <TABLE_NAME>(<FILEDS>)
PARTITIONED BY (
    year SMALLINT,
    month TINYINT
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\037'
STORED AS PARQUET TBLPROPERTIES ("parquet.compression"="SNAPPY");

As for the INSERT hql script:

SET dfs.block.size=134217728;
SET hive.exec.reducers.bytes.per.reducer=134217728;
SET hive.merge.mapfiles=true;
SET hive.merge.size.per.task=134217728;
SET hive.merge.smallfiles.avgsize=67108864;
SET hive.exec.compress.output=true;
SET mapred.max.split.size=134217728;
SET mapred.output.compression.type=BLOCK;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
INSERT OVERWRITE TABLE <TABLE_NAME>
PARTITION (year=<YEAR>, month=<MONTH>)
SELECT <FIELDS>
from <ANOTHER_TABLE> where year=<YEAR> and month=<MONTH>;

The issue is file seizes are all over the place:

partition 1: 1 file: size = 163.9 M 
partition 2: 2 file: size = 207.4 M, 128.0 M
partition 3: 3 file: size = 166.3 M, 153.5 M, 162.6 M
partition 4: 3 file: size = 151.4 M, 150.7 M, 45.2 M

The issue is the same no matter the dfs.block.size setting (and other settings above) increased to 256M, 512M or 1G (for different data sets).

Is there a way/settings to make sure that the splitting of the output parquet/snappy files are just below hdfs block size?

I ended up hacking the solution, using pyspark. Check the size of original data and come up with the "good ratio" of compression from original to parquet snappy (~1.4 from gz) n = int(math.ceil(size * 1.4 / hdfs_block_size)) df.repartition(n).write.parquet(some_path) (That was the solution that worked back in 2015) — Hatim Diab, Mar 31 '18 at 05:05

score 3 · Answer 1 · answered Nov 13 '15 at 23:25

3

There is not a way to close files once they grow to the size of a single HDFS block and start a new file. That would go against how HDFS typically works: having files that span many blocks.

The right solution is for Impala to schedule its tasks where the blocks are local instead of complaining that the file spans more than one block. This was completed recently as IMPALA-1881 and will be released in Impala 2.3.

answered Nov 13 '15 at 23:25

blue

1,005
12
10

Thanks Ryan, this really works well for columns that are not complex types. Is there any alternative way to use `INSERT INTO...SELECT` in Impala to deal with complex events? Also, if HDFS is configured with block size of 128MB then will setting parquet block size to 256MB make sense? Is getting down to one block per file the ideal scenario? Thanks! – user2727704 Jan 26 '16 at 20:08

score 1 · Answer 2 · edited Jun 17 '15 at 04:13

1

You need to both parquet block size and dfs block size set:

SET dfs.block.size=134217728;  
SET parquet.block.size=134217728;

Both need to be set to the same because you want a parquet block to fit inside an hdfs block.

edited Jun 17 '15 at 04:13

Shahzad Barkati

2,532
6
25
33

answered Jun 17 '15 at 03:46

Stamperious

11
2

3

Thank you, Just tried it did not work, I'm guessing as the final output is parquet/snappy. – Hatim Diab Jun 17 '15 at 18:49
What about mapred.max.split.size? I think that also matters. – Tagar Jul 24 '15 at 04:54
1

As of Parquet 1.8.0, the block size will be automatically set to the row group size if it is smaller than the row group size (parquet.block.size). That way you shouldn't get this error if you forget to set it. – blue Nov 13 '15 at 23:20

Tagar · Answer 3 · 2015-11-17T06:03:48.393

In some cases you can set parquet block size by setting mapred.max.split.size (parquet 1.4.2+) which you already did. You can put it lower than hdfs block size to increase parallelism. Parquet tries to align to hdfs blocks, when possible:

https://github.com/Parquet/parquet-mr/pull/365

Edit 11/16/2015: According to https://github.com/Parquet/parquet-mr/pull/365#issuecomment-157108975 this also might be IMPALA-1881 which is fixed in Impala 2.3.

Set parquet snappy output file size is hive?

3 Answers3

Linked