2

I have a file:

id,name,address
001,adam,1-A102,mont vert
002,michael,57-D,costa rica

I have to create a hive table which will contain three columns : id, name and address using comma delimited but here the address column itself contains comma in between. How are we going to handle this.

leftjoin
  • 36,950
  • 8
  • 57
  • 116
Sanskar Suman
  • 63
  • 1
  • 12

1 Answers1

3

One possible solution is using RegexSerDe:

CREATE TABLE table my_table (
    id       string,
    name     string,
    address  string
) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES ('input.regex'='^(.*?),(.*?),(.*?)$') 
location 'put location here' 
;

Replace location property with your table location and put the file(s) into that location.

First group (.*?) will match everything before first comma, second group will match everything after first comma and before second comma and third group will match everything after second comma.

Also add TBLPROPERTIES("skip.header.line.count"="1") if you need to skip header and it always exists in the file. If header can be absent, then you can filter header rows using where id !='id'

Also you can easily test Regex for extracting columns even without creating table, like this:

select regexp_replace('002,michael,57-D,costa rica','^(.*?),(.*?),(.*?)$','$1|$2|$3');

Result:

002|michael|57-D,costa rica

In this example query returns three groups, separated by |. In such way you can easily test your regular expression, check if groups are defined correctly before creating the table with it.

Answering question in the comment. You can have address with comma and one more column without comma like this:

select regexp_replace('001,adam,1-A102, mont vert,sydney','^(.*?),(.*?),(.*?),([^,]*?)$','$1|$2|$3|$4');

Returns:

001|adam|1-A102, mont vert|sydney

Checking comma is optional in Address column:

hive> select regexp_replace('001,adam,1-A102 mont vert,sydney','^(.*?),(.*?),(.*?),([^,]*?)$','$1|$2|$3|$4');

Returns:

001|adam|1-A102 mont vert|sydney

Read this article for better understanding: https://community.cloudera.com/t5/Community-Articles/Using-Regular-Expressions-to-Extract-Fields-for-Hive-Tables/ta-p/247562

[^,] means not a comma, last column can be everything except comma.

And of course add one more column to the DDL.

leftjoin
  • 36,950
  • 8
  • 57
  • 116
  • id,name,address,city 001,adam,1-A102,mont vert,sydney 002,michael,57-D,costa rica,melbourne suppose I added one more column here, So how are we going to implement serde here. – Sanskar Suman Sep 04 '19 at 09:42
  • @SanskarSuman And how can you define rule to determine is it is address containing comma or address, (comma) city? I'm asking this because you said that address can contain comma. If you are able to define the rule, then it should be possible to implement regex. – leftjoin Sep 04 '19 at 10:10
  • This is what I am asking here. If I have an additional column city at the end, so how I can make sure that this data is going to be loaded in my last column in hive table eg :- output - 002|michael|57-D,costa rica | Sydney input- 002,michael,57-D,costa rica,Sydney lets consider that address column doesn't contain any city name and we have a separate city column for this – Sanskar Suman Sep 04 '19 at 10:45