2

I am trying to create a hive table with partition by a single field. The data that i wanted to process is log data. Format of log is:

DATE TIME IPAddress HTTP_METHOD MESSAGE

Create table hive query:

CREATE EXTERNAL TABLE test_Part(
logdate string,
logtime string,
ip string,
message string)
PARTITIONED BY(method string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "(\\d{4}-\\d{2}-\\d{2})\\s(\\d{2}:\\d{2}:\\d{2})\\s(\\d+\\.\\d+\\.\\d+\\.\\d+)\\s(\\S+)\\s(.*$)",
"output.format.string" = "%1$s %2$s %3$s %5$s %4$s"
)
STORED AS TEXTFILE;

And load script:

LOAD DATA LOCAL INPATH '/home/user/tools/apache-hive-1.2.2-bin/scripts/sample1.log' OVERWRITE INTO TABLE test_Part PARTITION(method='GET');

When i run a select query on the above table, it gives me error message as

Failed with exception java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: Number of matching groups doesn't match the number of columns

What am i missing ?

dtolnay
  • 9,621
  • 5
  • 41
  • 62
Sampath
  • 45
  • 2
  • 6

2 Answers2

1

Partitions reflect a condition where rows characterized by specific value(s) (partition column(s) values) are accessible directly without the need access additional unnecessary data.
This is clearly not the case here, so you can't declare a partitioned table over your data file(s).

From this point you can do the following:

1.

leave out the partitions.

CREATE EXTERNAL TABLE test_Part_stg(
logdate string,
logtime string,
ip string,
method string,
message string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "(\\d{4}-\\d{2}-\\d{2})\\s(\\d{2}:\\d{2}:\\d{2})\\s(\\d+\\.\\d+\\.\\d+\\.\\d+)\\s(\\S+)\\s(.*$)"
)
STORED AS TEXTFILE;

2. (optional)

Create an additional partitioned table and fill it using the table from the previous step.

(This table storage does not have to be TEXTFILE)

CREATE EXTERNAL TABLE test_Part(
logdate string,
logtime string,
ip string,
message string)
PARTITIONED BY(method string)
STORED AS TEXTFILE;

set hive.exec.dynamic.partition.mode=nonstrict
;

insert into test_Part partition (method)
select logdate,logtime,ip,message,method 
from   test_Part_stg
;

or

insert into test_Part partition (method) (logdate,logtime,ip,method,message)
select * 
from   test_Part_stg
;

P.s.

output.format.string was deprecated, it is not defined in the RegexSerDe and it has no functional meaning as any other undefined SerDe parameter.

David דודו Markovitz
  • 42,900
  • 6
  • 64
  • 88
0

Should be 4 groups because there are 4 columns in the table DDL. In your regex there are 5 groups. method is a partition (directory with files), this column usually does not exist in the file itself and you do not need to specify it in regex.

Partition is stored as metadata containing partition location and key value. Location looks like table_dir/method=GET/

See good example here: http://www.dowdandassociates.com/blog/content/howto-use-hive-with-apache-logs/

If partition column also exists in the file, you should add more column to the table definition. It seems column exists. Is it HTTP_METHOD? then simply add one more column HTTP_METHOD before message column and check your regexp once more.

leftjoin
  • 36,950
  • 8
  • 57
  • 116