1

I need to load an s3 data in hive table. This s3 location is dynamic and is stored in another static s3 location.

The dynamic s3 location which I want to load in hive table has path format

s3://s3BucketName/some-path/yyyy-MM-dd

and the static location has data format

{"datasetDate": "datePublished", "s3Location": "s3://s3BucketName/some-path/yyyy-MM-dd"}

Is there a way to read this data in hive? I searched about this a lot but could not find anything.

leftjoin
  • 36,950
  • 8
  • 57
  • 116
hatim009
  • 13
  • 4

1 Answers1

1

You can read JSON data from your static location file, parse s3Location field and pass it as a parameter to your add partition clause.

Possible way to read json is using Hive. You can use some other means for the same.

Example using Hive.

create table data_location(location_info string)
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY  '\t' LINES TERMINATED BY '\n' 
STORED AS TEXTFILE
    LOCATION 's3://s3BucketName/some-static-location-path/';

Then get the location in the shell script and pass it as a parameter to ADD partition statement.

For example you have table named target_table partitioned by datePublished. You can add partitions like this:

#!/bin/bash

data_location=$(hive -e "set hive.cli.print.header=false; select get_json_object(location_info,'$.s3Location') from data_location")

#get partition name
partition=$(basename ${data_location})

#Create partition in your target table:
hive -e "ALTER TABLE TARGET_TABLE ADD IF NOT EXISTS PARTITION (datePublished='${partition}') LOCATION '${data_location}'"

If you do not want partitioned table, then you can use ALTER TABLE SET LOCATION instead of adding partition:

hive -e "ALTER TABLE TARGET_TABLE SET LOCATION='${data_location}'"

If it is only the last subfolder name is dynamic (which is date) and base directory is always the same, like s3://s3BucketName/some-path/, only yyyy-MM-dd is changing, you can create table once with location s3://s3BucketName/some-path/ and issue RECOVER PARTITIONS statement. In this case you do not need to read the content of file with location specification. Just schedule RECOVER PARTITIONS to get new partition attached on daily basis.

leftjoin
  • 36,950
  • 8
  • 57
  • 116
  • Is it possible doing this using hive queries only i.e. without involving shell script ? – hatim009 Oct 02 '19 at 07:15
  • @hatim009 Only the last option maybe - if the main path is constant and new partitions are added as dates(folders) - then you can run only RECOVER and do not need to parametrize it. And it can be not necessarily shell. It can be some workflow step, calculating parameters for the next step – leftjoin Oct 02 '19 at 07:18
  • Tried `MSCK [REPAIR] TABLE tablename;` but it was just listing the new partitions and not updating it. Though bash script worked for me. Thanks @leftjoin. – hatim009 Oct 10 '19 at 11:55