Convert data from gzip to sequenceFile format using Hive on spark

Question

I'm trying to read a large gzip file into hive through spark runtime to convert into SequenceFile format

And, I want to do this efficiently.

As far as I know, Spark supports only one mapper per gzip file same as it does for text files.

Is there a way to change the number of mappers for a gzip file being read? or should I choose another format like parquet?

I'm stuck currently. The problem is that my log file is json-like data save into txt-format and then was gzip - ed, so for reading I used org.apache.spark.sql.json.

The examples I have seen that show - converting data into SequenceFile have some simple delimiters as csv-format.

I used to execute this query:

create TABLE table_1
USING org.apache.spark.sql.json
OPTIONS (path 'dir_to/file_name.txt.gz');

But now I have to rewrite it in something like that:

CREATE TABLE table_1(
ID BIGINT,
NAME STRING 
)
COMMENT 'This is table_1 stored as sequencefile'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS SEQUENCEFILE;

LOAD DATA INPATH 'dir_to/file_name.txt.gz' OVERWRITE INTO TABLE         table_1;


LOAD DATA INPATH 'dir_to/file_name.txt.gz' INTO TABLE table_1;


INSERT OVERWRITE TABLE table_1 SELECT id, name from table_1_text;

INSERT INTO TABLE table_1 SELECT id, name from table_1_text;

Is this the optimal way of doing this, or is there a simpler approach to this problem? Please help!

score 2 · Accepted Answer · edited Jul 21 '17 at 06:45

2

As gzip textfile file is not splitable ,only one mapper will be launched or you have to choose other data formats if you want to use more than one mappers.

If there are huge json files and you want to save storage on hdfs use bzip2 compression to compress your json files on hdfs.You can query .bzip2 json files from hive without modifying anything.

edited Jul 21 '17 at 06:45

David דודו Markovitz

42,900
6
64
88

answered Jul 21 '17 at 05:49

user2017

444
1
4
14

Thank you for your response. But have you tried also to CREATE TABLE IN ROW FORMAT? – Marcel Mars Jul 21 '17 at 09:22
Could you also confirm my assumptions? As it is written in manual bzip2 files is processed with multiple mappers, however due to larger time of decompression I do not win in time of reading, in case of benchmarking hive queries execution on small data sets – Marcel Mars Jul 21 '17 at 14:26
Yes. Storing gzip file in "sequence file format" table would definitely work as AS sequence file is split-able. I think your second approach will work.You can also refer https://cwiki.apache.org/confluence/display/Hive/CompressedStorage – user2017 Jul 21 '17 at 16:47

Convert data from gzip to sequenceFile format using Hive on spark

1 Answers1