2

I have a file that looks like this:

33.49.147.163           20140416123526  https://news.google.com/topstories?hl=en-US&gl=US&ceid=US:en    29  409 Firefox/5.0

I want to load it into a hive table. I do it this way:

create external table Logs (
    ip string,
    ts timestamp,
    request string,
    page_size smallint,
    status_code smallint,
    info string
)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties (
"timestamp.formats" = "yyyyMMddHHmmss",
"input.regex" = '^(\\S*)\\t{3}(\\d{14})\\t(\\S*)\\t(\\S*)\\t(\\S*)\\t(\\S*).*$'
)
stored as textfile
location '/data/user_logs/user_logs_M';

And

select * from Logs limit 10;

results in

33.49.147.16 NULL https://news.google.com/topstories?hl=en-US&gl=US&ceid=US:en 29 409 Firefox/5.0

How to parse timestamps correctly, to avoid this NULLs?

leftjoin
  • 36,950
  • 8
  • 57
  • 116
Maxim
  • 47
  • 1
  • 6
  • This format is not supported, found it [here](https://stackoverflow.com/questions/44705050/hive-timestamp-format) – Maxim Dec 11 '21 at 10:36
  • Not sure if its possible. You can load into normal table and then convert it to a timestamp – Koushik Roy Dec 11 '21 at 14:45

1 Answers1

2

"timestamp.formats" SerDe property works only with LazySimpleSerDe (STORED AS TEXTFILE), it does not work with RegexSerDe. If you are using RegexSerDe, then parse timestamp in a query.

Define ts column as STRING data type in CREATE TABLE and in the query transform it like this:

select timestamp(regexp_replace(ts,'(\\d{4})(\\d{2})(\\d{2})(\\d{2})(\\d{2})(\\d{2})','$1-$2-$3 $4:$5:$6.0')) as ts

Of course, you can extract each part of the timestamp using SerDe as separate columns and properly concatenate them with delimiters in the query to get correct timestamp format, but it will not give you any improvement because anyway you will need additional transformation in the query.

leftjoin
  • 36,950
  • 8
  • 57
  • 116