0

I have multiple subdirectories in S3 that contain .orc files. I'm trying to create a hive metastore so I can query the data with Presto / Hive, etc. The data is poorlly structured (no consistent delimiter, ugly characters, etc). Here's a scrubbed sample:

1488736466 199.199.199.199 0_b.www.sphericalcow.com.f9b1.qk-g6m6z24tdr.v4.url.name.com TXT IN: NXDOMAIN/0/143
1488736466 6.6.5.4 0.3399.186472.4306.6668.638.cb5a.names-things.update.url.name.com TXT IN: NOERROR/3/306 0\009253\009http://az.blargi.ng/%D3%AB%EF%BF%BD%EF%BF%BD/\009 0\009253\009http://casinoroyal.online/\009 0\009253\009http://d2njbfxlilvpsq.cloudfront.net/b_zq_ym_bangvideo/bangvideo0826.apk\009 

I was able to create a table pointing to one of the subdirectories using a serde regex and the fields are parsing properly, but as far as I can tell I can only load one subfolder at a time.

How does one add more data to an existing hive metastore?

Here's an example of my hive metastore create statement with the regex serde bit:

DROP TABLE IF EXISTS test;

CREATE EXTERNAL TABLE test (field1 string, field2 string, field3 string, field4 string)
COMMENT 'fill all the tables with the datas.' 
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' 
  WITH SERDEPROPERTIES (
"input.regex" = "([0-9]{10}) ([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}) (\\S*) (.*)",
"output.format.string" = "%1$s %2$s %3$s %4$s"
)
STORED AS ORC
LOCATION 's3://path/to/one/of/10/folders/'
tblproperties ("orc.compress" = "SNAPPY", "skip.header.line.count"="2");

select * from test limit 10;

I realize there is probably a very simple solution, but I tried INSERT INTO in place of CREATE EXTERNAL TABLE, but it understandably complains about the input, and I looked in both the hive and serde documentation for help but was unable to find a reference to adding to an existing store.

dtolnay
  • 9,621
  • 5
  • 41
  • 62
TheProletariat
  • 916
  • 2
  • 11
  • 23
  • 1
    Wow. (1) You are not "loading" anything. An external table is an interface containning instructions for data reading and possibly writing. (2) The metastore holds the definition of the table, not the data. (3) I seriously doubt you are querying ORC files using RegexSerDe. Also there is no indication for ORC in the external table definition. (4) "output.format.string" is obsolete – David דודו Markovitz Apr 17 '17 at 15:24
  • There 2 possible solutions. Add the structure of the folders so we'll see which is relevant. P.s. Add a data sample - There is a chance that RegexSerDe is an overkill here. – David דודו Markovitz Apr 17 '17 at 15:27
  • Thanks, @DuduMarkovitz. 1) Poor choice of words; I should know better. Edited. 2) see (1). 3) the raw files are .orc, but I think I'm missing a `stored as orc` line. Fixed now. 4) I didn't know that -- thanks. I'll add a data sample. There's no delimiter, so regex is necessary -- ugly data. You can probably tell, but I'm new to the Hadoop ecosystem, so I appreciate the help. – TheProletariat Apr 17 '17 at 15:47

2 Answers2

0

Possible solution using partitions.

CREATE EXTERNAL TABLE test (field1 string, field2 string, field3 string, field4 string) 
partitioned by (mypartcol string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' 
  WITH SERDEPROPERTIES (
"input.regex" = "([0-9]{10}) ([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}) (\\S*) (.*)"
)
LOCATION 's3://whatever/as/long/as/it/is/empty'
tblproperties ("skip.header.line.count"="2");

alter table test add partition (mypartcol='folder 1') location 's3://path/to/1st/of/10/folders/';
alter table test add partition (mypartcol='folder 2') location 's3://path/to/2nd/of/10/folders/';
.
.
.
alter table test add partition (mypartcol='folder 10') location 's3://path/to/10th/of/10/folders/';
David דודו Markovitz
  • 42,900
  • 6
  • 64
  • 88
0

For @TheProletariat (the OP)

It seems there is no need for RegexSerDe since the columns are delimited by space (' ').
Note the use of tblproperties ("serialization.last.column.takes.rest"="true")

create external table test 
(
    field1 bigint
   ,field2 string
   ,field3 string
   ,field4 string
)
row format delimited
fields terminated by ' '
tblproperties ("serialization.last.column.takes.rest"="true")
;
David דודו Markovitz
  • 42,900
  • 6
  • 64
  • 88
  • They're not actually delimited by space. They are DNS records which can take many different formats, some of which have multiple spaces, and some of which have none, like this: TXT IN: NXDOMAIN/0/148 TXT IN: NOERROR/1/124 10090_10203\009sphericalcow.com\009 etc. – TheProletariat Apr 17 '17 at 20:24