I'm on Cloudera 5.16 with Hadoop 2.6.
I use ImportTsv to load big csv files into HBase.
hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=';' -Dimporttsv.columns=HBASE_ROW_KEY,data:name,data:age mynamespace:mytable /path/to/csv/dir/*.csv
My problem, is whatever the size of the files is (I have files with 300k lines, and others with 1k lines), the operation take between 20 and 30seconds.
19/08/22 15:11:56 INFO mapreduce.Job: Job job_1566288518023_0335 running in uber mode : false
19/08/22 15:11:56 INFO mapreduce.Job: map 0% reduce 0%
19/08/22 15:12:06 INFO mapreduce.Job: map 67% reduce 0%
19/08/22 15:12:08 INFO mapreduce.Job: map 100% reduce 0%
19/08/22 15:12:08 INFO mapreduce.Job: Job job_1566288518023_0335 completed successfully
19/08/22 15:12:08 INFO mapreduce.Job: Counters: 34
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=801303
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=2709617
HDFS: Number of bytes written=0
HDFS: Number of read operations=6
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
HDFS: Number of bytes read erasure-coded=0
Job Counters
Launched map tasks=3
Data-local map tasks=3
Total time spent by all maps in occupied slots (ms)=25662
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=25662
Total vcore-milliseconds taken by all map tasks=25662
Total megabyte-milliseconds taken by all map tasks=26277888
Map-Reduce Framework
Map input records=37635
Map output records=37635
Input split bytes=531
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=454
CPU time spent (ms)=14840
Physical memory (bytes) snapshot=1287696384
Virtual memory (bytes) snapshot=8280121344
Total committed heap usage (bytes)=2418540544
Peak Map Physical memory (bytes)=439844864
Peak Map Virtual memory (bytes)=2776657920
ImportTsv
Bad Lines=0
File Input Format Counters
Bytes Read=2709086
File Output Format Counters
Bytes Written=0
I have created multiple regions, depending on the key, to distribute the puts, but it didn't change anything.
create 'mynamespace:mytable', {NAME => 'data', COMPRESSION => 'SNAPPY'}, {SPLITS => ['0','1','2','3','4','5']}
Anyone knows how to optimize this operation ?
Thanks.