0

I have a set of log files, created a Hive table, now i want to partition the table based on a col what I don't understand & have not seen examples is how to specify the column for partition how to specify the col/field Ex. here is line from the log

2012-04-11 16:49:10,629 ~ [http-7001-11] ~DE1F6F6667913022AE2620D1228817D6 ~ END ~ /admin/bp/setup/newedit/ok ~ pt ~ 219 ~

table struc is CREATE TABLE log (starttime STRING, thread STRING, session STRING, method STRING, targeturl STRING, registry string, ipaddress STRING, details STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '~'; Now if I want the 6the col i.e. 'registry' to be the partition for the logs , how would I write the partition statement. In general if I have a line (row) with cols c1,c2,..... c10 how to specify col ci in the partition? Thanks

casperOne
  • 73,706
  • 19
  • 184
  • 253
Integration
  • 337
  • 1
  • 4
  • 15

2 Answers2

0

What you have to do first is to explicitely tell which columns you expect to be partitions as you create your table. To make registry your partition:

CREATE TABLE log(starttime STRING, thread STRING, session STRING, method STRING, targeturl STRING, ipaddress STRING, details STRING)
PARTITIONED BY (registry STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '~'

You can add as many partitions as you want, each partition will be a nested sub-folder, and the order in which you declare your partitions in the PARTITIONED BY matters: the first one will be the top-level partition, the second one will be 1 level down, ... An example:

|--- mytable
      `---- mypartition1=x
                   `-------- mypartition2=x
                                     `------- ...

What I would do is:

  1. Create an external table pointing to the location of your log directory, with no partitions at all (just contains the raw log file), which will populate the table raw_log with all your data, with no partition (yet):

    CREATE EXTERNAL TABLE raw_log(starttime STRING, thread STRING, session STRING, method STRING, targeturl STRING, registry STRING, ipaddress STRING, details STRING)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '~'
    LOCATION '/path/to/your/log/dir'
    
  2. Use Hive's dynamic partitions feature to read from raw_log and insert into log:

    FROM raw_log raw
    INSERT OVERWRITE TABLE log PARTITION(registry)
           SELECT raw.starttime, raw.thread, raw.session, raw.method, raw.targeturl, raw.ipaddress, raw.details
    

You can read more on dynamic partitions on the official Apache wiki

Charles Menguy
  • 40,830
  • 17
  • 95
  • 117
  • I like your solution, will do it that way,I understood your solution,Thanks!. But just for understanding, If I don't create raw_log, how does one tell hive if I have a row of data having cols c1,c2,c3...c10 (i.e. sep by ex. ~) how does one specify col c3 or c5 as the partition col? – Integration Apr 23 '12 at 00:31
  • You do that in your dynamic partition query. In `INSERT OVERWRITE TABLE log PARTITION(registry)`, you are specifying that in your partitioned log table you would like registry to serve as the partition column. If you had multiple partition levels, you would do PARTITION(c3, c5) instead. – Mark Grover Apr 26 '12 at 03:57
  • @Integration Sorry for the delay, yes what Mark said is exactly right, you specify your partitions in the **PARTITION** statement. – Charles Menguy Apr 26 '12 at 15:55
0

In addition, hive has had an issue with external tables that are partitioned. Say that you create your table declaring it external, after you setup your data in subdirectories of the form, for instance, dt=21012-04-01, and other similar directories. All appears fine except that your data will not show in selects.

Elastic map reduce added a feature that solves this problem

ALTER TABLE log RECOVER PARTITIONS;

I am adding this info since it took me a while to find out why my data was not there.

Jaime Garza
  • 483
  • 1
  • 4
  • 10