Getting extra nulls when loading data into hive table while using regex delimiter

Question

I have the following 5 lines of data in a file on hdfs. I want to load this to a table. I have regex that will do it, but it is loading an extra row of nulls for each line of data. Does anyone know why this is happening?

19/Mar/2018 3:00:06 INFO activity Submitted to Splunk
19/Mar/2018 3:00:20 INFO activity response received statuscode=200 bytesreceived=11548264
19/Mar/2018 3:00:21 INFO activity done writing K:\Data\031818\activity_031818.csv lineswritten=296110
19/Mar/2018 3:00:21 INFO hardware Submitted to Splunk

I use this to create the table

create table Splunk_BCO_MSR 
(
ts string, 
status string, 
area string, 
text string
) 
partitioned by (partition_dt date)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' 
WITH SERDEPROPERTIES ("input.regex" = "([^ ]+[ ][^ ]*) ([^ ]*) ([^ ]*) (.*)?");

This almost works, however when I run a select * from the table, I get 8 rows instead of 4. It looks like there is an additional row of NULLS being added.

| 19/Mar/2018 3:00:06  | INFO              | activity  | Submitted to Splunk                                                                                                                                        | 2018-03-18                   |
| NULL                                     | NULL                   | NULL                       | NULL                                                                                                                                                                           | 2018-03-18                   |
| 19/Mar/2018 3:00:20  | INFO              | activity  | response received statuscode=200 bytesreceived=11548264                                                                | 2018-03-18                   |
| NULL                                     | NULL                   | NULL                       | NULL                                                                                                                                                                           | 2018-03-18                   |
| 19/Mar/2018 3:00:21  | INFO              | activity  | done writing K:\Data\031818\activity_031818.csv lineswritten=296110  | 2018-03-18                   |
| NULL                                     | NULL                   | NULL                       | NULL                                                                                                                                                                           | 2018-03-18                   |
| 19/Mar/2018 3:00:21  | INFO              | hardware  | Submitted to Splunk                                                                                                                                        | 2018-03-18                   |
| NULL                                     | NULL                   | NULL                       | NULL                                                                                                                                                                           | 2018-03-18

You might add delimiter to your query like: ` ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '`, which is space — HISI, Apr 04 '18 at 18:25
I think that you should add delimiter to the row before add data to hive — HISI, Apr 04 '18 at 18:27
So, row format delimited is replaced by row format serde in my script. That allows me to perform the right regex on the data... at least right until I get the duplicate line — Micah Pearce, Apr 05 '18 at 20:17

Getting extra nulls when loading data into hive table while using regex delimiter

0 Answers0