0

I have a similiar question to here: Hive table source delimited by multiple spaces

My data looks like this:

AL, 01, 2016010700,   , BEST,   0, 266N,  753W
AL, 01, 2016010706,   , BEST,   0, 276N,  747W
AL, 01, 2016010712,   , BEST,   0, 287N,  738W
AL, 01, 2016010712,   , BEST,   0, 287N,  738W

That means my column delimiter is "a comma plus a variable number of spaces".

I tried to simply modify field.delim by adding this comma to the regex, but it doesn't work. The result is, that all data gets put into the first column (basin) and all other columns are NULL.

CREATE EXTERNAL TABLE IF NOT EXISTS default.myTable1
(
   basin string
  ,cy string
  ,yyyymmddhh int
  ,technum_min string
  ,tech string
  ,tau string
  ,lat_n_s string
  ,lon_e_w string 
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe' 
WITH SERDEPROPERTIES ("field.delim"=",\\s+")
LOCATION '/data';

I am running HDP 2.5 (Hive 1.2.1).

Thanks for any help and suggestions.

dtolnay
  • 9,621
  • 5
  • 41
  • 62
BlueElephant
  • 1
  • 1
  • 2

2 Answers2

2

We have two approach to solve your problem. create table 'rawTbl' using below option

ROW FORMAT DELIMITED FIELDS TERMINATED BY ','

and use trim() to remove space

Insert into baseTbl select trim(basin), trim(cy),...., from rawTbl

OR you can use regEx

I have updated answer with regex which separate text input file composed of requested fields. Regex contains 7 regex groups capturing the requested field on each line.

CREATE EXTERNAL TABlE tableex(basin string
  ,cy string
  ,yyyymmddhh int
  ,technum_min string
  ,tech string
  ,tau string
  ,lat_n_s string
  ,lon_e_w string ) 
ROW FORMAT 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  "input.regex" = '^([A-Za-z]{2}),\s+(\d{2}),\s(\d{10}),\s+,\s([A-Z]{4}),\s+(\d{1}),\s+(\d{3}[A-Z]{1}),\s+(\d+[A-Z]{1})'
)
LOCATION '/data';
  • `input.regex` defines the columns' expressions, not the separator. – David דודו Markovitz Jul 27 '17 at 10:22
  • Yes, that's why I wanted to try `MultiDelimitSerDe` as the linked question indicated. Unfortunately, using the `RegexSerDe` doesn't work: `ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ("input.regex" = ",\s+")` I also tried it with double back-slashes: ",\\s+". It is always giving me all columns with NULL values. – BlueElephant Jul 27 '17 at 10:31
  • Hi :-) Better, but I would have used something like `\s*(\S*?),` instead of fixed lengh – David דודו Markovitz Jul 27 '17 at 11:59
  • I have one workaround : (1) create table with ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' (2) Insert into baseTbl select trim(basin), trim(cy),...., from rawTbl – Manish Saraf Bhardwaj Jul 27 '17 at 12:07
  • I upvoted this, but only for the first part of the solution. I would be quite sad if I received the second one and had to update or troubleshoot it. – Dennis Jaheruddin Jul 27 '17 at 13:35
  • @DuduMarkovitz : What is your feedback? – Manish Saraf Bhardwaj Jul 27 '17 at 13:46
  • I already commented ("Better, but..."), am I missing something? – David דודו Markovitz Jul 27 '17 at 14:07
  • @BlueElephant : please make the answer as correct if it works for you. – Manish Saraf Bhardwaj Jul 28 '17 at 04:31
  • @ManishSarafBhardwaj: Thanks for your answer. The INSERT INTO approach works file. Altnerativley, we could simply put a view on top of it with the trim statements. However, the second solution with the regex doesn't work for me - In my Hive version I need to write `ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'` => the keyword `SERDE` is needed in my Hive version. But all in all, I always get all columns with NULL values in it. Which Hive version are you using? – BlueElephant Jul 28 '17 at 06:44
  • @DuduMarkovitz : I tried something similiar with `"field.delim"=",\\s+"` (and some other variants) ... your regex looks interesting to me... but I am not sure how to interpret it... can you eloborate a bit on it? Of course, having an elegant regex to do the parsing would be perfect. – BlueElephant Jul 28 '17 at 06:48
  • 1) `field.delim` is a single character, not regex 2) the regex means a sequence of zero or more spaces and than a sequence of zero or more non spaces – David דודו Markovitz Jul 28 '17 at 13:39
0

how about this

(\S+),\s+(\S+),\s(\S+),\s+,\s(\S+)\s+(\S+),\s+(\S+),\s+(\S*)
user7343922
  • 316
  • 4
  • 17