0

I'm on Cloudera 5.16 with Hadoop 2.6.

I use ImportTsv to load big csv files into HBase.

hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=';' -Dimporttsv.columns=HBASE_ROW_KEY,data:name,data:age mynamespace:mytable /path/to/csv/dir/*.csv

My problem, is whatever the size of the files is (I have files with 300k lines, and others with 1k lines), the operation take between 20 and 30seconds.

19/08/22 15:11:56 INFO mapreduce.Job: Job job_1566288518023_0335 running in uber mode : false
19/08/22 15:11:56 INFO mapreduce.Job:  map 0% reduce 0%
19/08/22 15:12:06 INFO mapreduce.Job:  map 67% reduce 0%
19/08/22 15:12:08 INFO mapreduce.Job:  map 100% reduce 0%
19/08/22 15:12:08 INFO mapreduce.Job: Job job_1566288518023_0335 completed successfully
19/08/22 15:12:08 INFO mapreduce.Job: Counters: 34
        File System Counters
                FILE: Number of bytes read=0
                FILE: Number of bytes written=801303
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=2709617
                HDFS: Number of bytes written=0
                HDFS: Number of read operations=6
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=0
                HDFS: Number of bytes read erasure-coded=0
        Job Counters
                Launched map tasks=3
                Data-local map tasks=3
                Total time spent by all maps in occupied slots (ms)=25662
                Total time spent by all reduces in occupied slots (ms)=0
                Total time spent by all map tasks (ms)=25662
                Total vcore-milliseconds taken by all map tasks=25662
                Total megabyte-milliseconds taken by all map tasks=26277888
        Map-Reduce Framework
                Map input records=37635
                Map output records=37635
                Input split bytes=531
                Spilled Records=0
                Failed Shuffles=0
                Merged Map outputs=0
                GC time elapsed (ms)=454
                CPU time spent (ms)=14840
                Physical memory (bytes) snapshot=1287696384
                Virtual memory (bytes) snapshot=8280121344
                Total committed heap usage (bytes)=2418540544
                Peak Map Physical memory (bytes)=439844864
                Peak Map Virtual memory (bytes)=2776657920
        ImportTsv
                Bad Lines=0
        File Input Format Counters
                Bytes Read=2709086
        File Output Format Counters
                Bytes Written=0

I have created multiple regions, depending on the key, to distribute the puts, but it didn't change anything.

create 'mynamespace:mytable', {NAME => 'data', COMPRESSION => 'SNAPPY'}, {SPLITS => ['0','1','2','3','4','5']}

Anyone knows how to optimize this operation ?

Thanks.

Eric C
  • 165
  • 12

1 Answers1

0

I think there are a couple of things that you can do to improve this:

  1. Looking at how you create a table I can see that you are not pre defining the number of regions, which will serve this specific table. Assuming this is a new table that you are populating, HBase would have to take on additional load of splitting existing regions and your imports will take longer.

What I would suggest is to set the number of regions for your table by adding this:

NUMREGIONS => "some reasonable number depending on the size of the initial table"

When I say initial table, I mean accommodating the volume of data that you know you are going to load in to it. The data which will be gradually added later does not necessarily needs to be accommodated for at this point. (Since you don't want to have "half" empty region processes running)

  1. I am not sure how evenly distributed your keys would be, and normally an md5 hash of the key is used to give an even distribution of entries between region servers and avoid skews. This might be another point to consider, since you may end up in a situation where a single mapper is getting more load than others and hence the length of the job will depend on the execution of this single mapper. So I'd be very careful using pre splits, unless you really know what you are doing. As an alternative I can suggest you use this for your table instead of manual pre splits:

SPLITALGO => 'UniformSplit'

I'd also suggest you google around about the above.

I don't really know your specific use case, so I can't give you a more profound answer, but I believe these will help you improve the performance of the import of data in to your table.

Sergey
  • 31
  • 3
  • Hey Sergey, I v already tried several numbers of regions and different split algos but nothing has changed. It takes something like 12 seconds to start the mapreduce, and 12-15seconds to do the maps and the reduces – Eric C Aug 26 '19 at 09:32
  • 1
    Well, I guess in this case you'll have to accept the limitation of the overhead of spinning up a map reduce job, because at this scale you won't be able to make much of a difference :) – Sergey Aug 27 '19 at 13:35