So I need to create a external table for some data stored on S3 and add partitions explicitly (unfortunately, the directory hierarchy does not fit the dynamic partition functionality due to the name mismatch) for example:
add partition for region:euwest1, year:2018, month:01, day:18, hour:18 at:s3://mybucket/mydata/euwest1/YYYY=2018/MM=01/dd=18/HH=18/
I ran this on an EMR cluster with Hive 2.3.2 and instance type r4.2xarge, which has 8 vCores and 61GB ram. It takes about 4 seconds to finish adding one partition, it's not too bad but if we need to process multiple days of data then adding partitions would take a long time.
Is there anyway to make this process faster? Thanks