1

So I need to create a external table for some data stored on S3 and add partitions explicitly (unfortunately, the directory hierarchy does not fit the dynamic partition functionality due to the name mismatch) for example:

 add partition for region:euwest1, year:2018, month:01, day:18, hour:18     at:s3://mybucket/mydata/euwest1/YYYY=2018/MM=01/dd=18/HH=18/

I ran this on an EMR cluster with Hive 2.3.2 and instance type r4.2xarge, which has 8 vCores and 61GB ram. It takes about 4 seconds to finish adding one partition, it's not too bad but if we need to process multiple days of data then adding partitions would take a long time.

Is there anyway to make this process faster? Thanks

seiya
  • 1,477
  • 3
  • 17
  • 26
  • ALTER TABLE ADD PARTITIONS time could depend on the number of object inside a partition. If you have large number of objects/files on s3, then it could be expected. Also what prefix are you using on EMR ? s3:// or s3a:// ? You can enable DEBUG on hive-client or hs2 and Hive-Metastore to check the timeline. – jc mannem Jan 19 '18 at 23:08
  • @jcmannem thanks for the quick answer. Yes for each partition there are couple of hundred files so as you said it could be part of the reason. As for the prefix I'm using S3:// on EMR which according to EMR doc should be backed by EMRFS. Unfortunately s3a:// is not yet supported by EMR. – seiya Jan 22 '18 at 15:51
  • Yes, s3a:// is open source implementation and is not quite supported. Not sure if we have any hive.metastore.* parameters like threads to alter performance of adding partitions. Since the FS is s3 , there could be also some fs.s3.* parameters that we can tweak to speed up Lists etc. If you use Hive-server2, there might be some parameters to tweak there as well. I'd enable DEBUG to establish a timeline and tweak respective parameters to make it faster. We might also want to make sure hive-client or metastore or HS2 is not having bottleneck on memory because EMR provisions them with 1GB. – jc mannem Jan 23 '18 at 01:31

0 Answers0