18

I'm using hive (with external tables) to process data stored on amazon S3.

My data is partitioned as follows:

                       DIR   s3://test.com/2014-03-01/
                       DIR   s3://test.com/2014-03-02/
                       DIR   s3://test.com/2014-03-03/
                       DIR   s3://test.com/2014-03-04/
                       DIR   s3://test.com/2014-03-05/

s3://test.com/2014-03-05/ip-foo-request-2014-03-05_04-20_00-49.log
s3://test.com/2014-03-05/ip-foo-request-2014-03-05_06-26_19-56.log
s3://test.com/2014-03-05/ip-foo-request-2014-03-05_15-20_12-53.log
s3://test.com/2014-03-05/ip-foo-request-2014-03-05_22-54_27-19.log

How to create a partition table using hive?

   CREATE EXTERNAL TABLE test (
    foo string,
    time string,
    bar string
    )  PARTITIONED BY (? string)
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY '\t'
    LOCATION 's3://test.com/';

Could somebody answer this question ? Thanks!

Brisi
  • 1,781
  • 7
  • 26
  • 41

5 Answers5

48

First start with the right table definition. In your case I'll just use what you wrote:

CREATE EXTERNAL TABLE test (
    foo string,
    time string,
    bar string
)  PARTITIONED BY (dt string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LOCATION 's3://test.com/';

Hive by default expects partitions to be in subdirectories named via the convention s3://test.com/partitionkey=partitionvalue. For example

s3://test.com/dt=2014-03-05

If you follow this convention you can use MSCK to add all partitions.

If you can't or don't want to use this naming convention, you will need to add all partitions as in:

ALTER TABLE test
    ADD PARTITION (dt='2014-03-05')
    location 's3://test.com/2014-03-05'
Carter Shanklin
  • 2,967
  • 21
  • 18
  • 1
    The above query is working without any syntactic error. But doesn't load any data. output is 0 bytes. Can u help me? – Brisi Mar 10 '14 at 13:16
  • 1
    ALTER TABLE px_logs ADD PARTITION (ds=date_sub('${DAY}', 1)); When i'm trying to use python code for get current date also got error: FAILED: Parse Error: line 3:38 cannot recognize input 'date_sub' in constant – Brisi Mar 10 '14 at 13:42
  • @brisk I got the same problem, nothing loaded, the metastore show me the localtion still has dt= style, which is not we set in ALTER. Do you get any idea now? – timesking Mar 31 '16 at 04:23
  • Use `MSCK REPAIR TABLE table_name;` after the files are there in the mentioned partition format for the partitions to be auto added in one go! – Sahil Sareen Jul 04 '17 at 17:09
  • @carter-shanklin, can you elaborate on the use of MSCK in this context? Also, I search the web to novail and cannot find any reference to the convention you are speaking. Can you provide any references? – Hans Deragon Feb 06 '19 at 18:27
4

If you are going to partition using date field you need s3 folder structure as mentioned below:

s3://test.com/date=2014-03-05/ip-foo-request-2014-03-05_04-20_00-49.log

In such case you can create external table with partition column as date and run MSCK REPAIR TABLE EXTERNAL_TABLE_NAME to update hive meta store.

arghtype
  • 4,376
  • 11
  • 45
  • 60
Venkatesh
  • 1,308
  • 1
  • 12
  • 10
  • CREATE EXTERNAL TABLE test ( foo string, time string, bar string ) PARTITIONED BY (date string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION 3://test.com/'; MSCK REPAIR TABLE test; I ran this script. It's does not load datas from s3.. In the given above s3 path have many log files are there.. How to use the MSCK REPAIR TABLE?. – Brisi Mar 06 '14 at 12:54
  • Can you post the output from your s3 folder – Venkatesh Mar 07 '14 at 05:52
4

If you have existing directory structure that doesn't comply <partition name>=<partition value>, you have to add partitions manually. MSCK REPAIR TABLE won't work unless you structure your directory like so.

After you specify location on table creation like:

CREATE EXTERNAL TABLE test (
    foo string,
    time string,
    bar string
)  
PARTITIONED BY (dt string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LOCATION 's3://test.com/';

You can add partition without specifying full path:

ALTER TABLE test ADD PARTITION (dt='2014-03-05') LOCATION '2014-03-05';

Although I've never checked it, I suggest you to move your partitions into a folder inside the bucket, not directly in the bucket itself. E.g. from s3://test.com/ to s3://test.com/data/.

cakraww
  • 2,493
  • 28
  • 30
2

Please look at the response posted above by Carter Shanklin. You need to make sure your files are stored in the directory structure as partitionkey=partitionvalue i.e. Hive by default expects partitions to be in subdirectories named via the convention.

In your example it should be stored as

s3://test.com/date=20140305/ip-foo-request-2014-03-05_04-20_00-49.log.

Steps to be followed:

i) Make sure data exists in the above structure ii) Create the external table iii) Now run the msck repair table.

doniv
  • 21
  • 1
1

I think the the data is present in the s3 location and might not updated in the metadata, (emrfs). In order this to work first do emrfs import and emrfs sync. And then apply the msck repair.

It will add all the partitions that are present in s3

loneStar
  • 3,780
  • 23
  • 40