3

Currently an insert overwrite table T1 select * from T2; will take around 100 minutes in my cluster. Table T1 is ORC formatted and T2 is text formatted. I am reading a 60 GB of text data from T2 and inserting into ORC table T1(10 GB after insertion). If i use text format for both tables insert will take around 50 min. In both cases what are the things we can do to improve write speed( I have large tables coming in) or any other suggestions??

BruceWayne
  • 3,286
  • 4
  • 25
  • 35
Despicable me
  • 548
  • 1
  • 9
  • 24

2 Answers2

2

I have recently derived an approach which splits the source file into partitions this takes around 6mins from text table to orc table in hive for 100GB data. Approach below

Before inserting the file into text table

1.split the file into small partitions in unix location using split command

2.then remove the original file from the path and just keep the files splitted.

Inserting into text table

3.now load the data into text table

4.it will take some mins to load and u can see that there will be same number of partitions as you have done at unix level

Inserting into orc table

  1. Ex: you have splitted the actual file into let say 20 partitions then you would see 20 tasks/containers being run on the cluster to load into the orc table which is very much faster than the other solutions which i came across

@despicable-me

A.r. Naresh
  • 43
  • 11
1

That is probably a normal behaviour as when you write data from text to text - it just writes data line by line from one file into another. Text-to-ORC will do some more work besides of it. Comparing to the text-to-text operation, text-to-orc importing will perform additional bucket-partition operations and compression operations to you data. That is the resaon of your time impacts. ORC format gives two main benefits upon text format:

  • save of space due to compression
  • improve access time to work with the data

Usually the INSERT operation is a single time operation, while access operations will be very frequent. So it usually makes sence to spend some more time at the beginning on importing the data and then have a huge benefite in saving space due to optimized storage of the data and in optimized access time to this data

Alex
  • 8,827
  • 3
  • 42
  • 58
  • Thanks Alex.. But i am looking into how to speed up insert overwrite, becoz we have strict SLAs on processing time and have data of the size 250 GB(text data)... So we want to optimize the existing process (existing process is to update table with incremental data and remove duplicate data , like "insert into t1 select * from incremental table t2" and then using a rank query we will filter out duplicates and insert overwrite to T1). So any parameters and approachs that can be leveraged?? – Despicable me Aug 08 '16 at 10:35
  • @Despicableme Not sure I fully got the overrall idea, but probably you can apply four-steps stratagy for this. We used to follow this approach for incremental data updates with reconsiliation of the duplicates. You still need to have a base ORC table which will have all the records, but you will be able to optimize your incremental updates by decreasing amounts of data. See http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/ – Alex Aug 08 '16 at 11:16
  • Okay.. Thanks... To simplyfi , is there any parameters or any approach (like somekind of parallelism) that we can apply so that insert overtwrite will be faster (not query optimization but simple data write speed) .... Thanks again for the quick responses.. – Despicable me Aug 08 '16 at 17:55
  • @Despicableme You can try using dynamic partitioning on your INSERT OVERWRITE operation. I don't know if it will work for you, besides it also depends on your configuration and number of nodes in your cluster. But specifying a right column for partitioning Hive should automatically identify the number of partitions at the run time. Your import job would be split on the number of partition-based tasks across all the nodes of your cluster.But be careful with this as too much partitions would decrease the performance. https://cwiki.apache.org/confluence/display/Hive/DynamicPartitions – Alex Aug 08 '16 at 19:13
  • Yes.. Sure..will try that... Thank You – Despicable me Aug 08 '16 at 19:22