0

I have a bunch of gzipped files in HDFS under directories of the form /home/myuser/salesdata/some_date/ALL/<country>.gz , for instance /home/myuser/salesdata/20180925/ALL/us.gz

The data is of the form

<country> \t count1,count2,count3

So essentially it's first tab separated and then I need to extract the comma separated values into separate columns

I'd like to create an external table, partitioning this by country, year, month and day. The size of the data is pretty huge, potentially 100s of TB and so I'd like to have an external table itself, rather than having to duplicate the data by importing it into a standard table.

Is it possible to achieve this by using only an external table?

Craig
  • 1,929
  • 5
  • 30
  • 51

1 Answers1

1

considering your country is separated by tab '\t' and other fields separated by , this is what you can do.

You can create a temporary table which has first columns as string and rest as array.

create external table temp.test_csv (country string, count array<int>)
row format delimited
fields terminated by "\t"
collection items terminated by ',' 
stored as textfile
location '/apps/temp/table';

Now if you drop your files into the /apps/temp/table location you should be able to to select the data as mentioned below.

select country, count[0] as count_1, count[1] count_2, count[2] count_3 from temp.test_csv

Now to create partitions create another table, as mentioned below.

drop table temp.test_csv_orc;
create table temp.test_csv_orc ( count_1 int, count_2 int, count_3 int) 
partitioned by(year string, month string, day string, country string) 
stored as orc; 

And load the data from temporary table into this one.

insert into temp.test_csv_orc partition(year="2018", month="09", day="28", country) 
select count[0] as count_1, count[1] count_2, count[2] count_3, country from temp.test_csv

I have taken country as Dynamic Parition as it's coming from file however others aren't so it's static.

Gaurang Shah
  • 11,764
  • 9
  • 74
  • 137
  • Thanks - but the problem is that I _do not_ want to create a new table due to resource constraints. I need to restrict myself to using external tables alone – Craig Sep 26 '18 at 00:21
  • in that case you want be able to make partition on `year, month and day.` as it's not part of your data. You could always load the file in table using `insert` statement to achieve this. – Gaurang Shah Sep 26 '18 at 00:51