How to handle delimiter in column value?

Question

I am trying to load CSV file data into my Hive table,but but it has delimiter(,) , in one column's value, so Hive is taking it as a delimiter and loading it into a new column. I tried using escape sequence \ but in that I also \ (it its not working and always loading data in new column after , .

My CSV file.:

        id,name,desc,per1,roll,age
        226,a1,"\"double bars","item1 and item2\"",0.0,10,25
        227,a2,"\"doubles","item2 & item3 item4\"",0.1,20,35
        228,a3,"\"double","item3 & item4 item5\"",0.2,30,45
        229,a4,"\"double","item5 & item6 item7\"",0.3,40,55

I have updated my table.:

    create table testing(id int, name string, desc string, uqc double, roll int, age int) 
    ROW   FORMAT SERDE 
    'org.apache.hadoop.hive.serde2.OpenCSVSerde'
     WITH SERDEPROPERTIES (
    "separatorChar" = ",",
    "quoteChar" = '"',
    "escapeChar" = "\\" ) STORED AS textfile;

But still I'm getting data in a different column after ,.

I'm using load data in path command.

But it is still a mess with quotes and escaping. Why only those two quotes are escaped and not all inside the string? Try using RegexSerDe instead of textFile, you can write regex for correct parsing of such records — leftjoin, Jan 07 '21 at 08:42

score 1 · Accepted Answer · answered Jan 07 '21 at 17:15

This is how to create table based on RegexSerDe.

Each column should have corresponding capturing group () in the regex. You can easily debug regex without creating the table using regex_replace:

select regexp_replace('226,a1,"\"double bars","item1 and item2\"",0.0,10,25',
                      '^(\\d+?),(.*?),"(.*)",([0-9.]*),([0-9]*),([0-9]*).*', --6 groups
                     '$1 $2 $3 $4 $5 $6'); --space delimited fields

Result:

226 a1 "double bars","item1 and item2" 0.0 10 25

If it seems good, create table:

 create external table testing(id int, 
                      name string, 
                      desc string, 
                      uqc double, 
                      roll int, 
                      age int
                     ) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' 
WITH SERDEPROPERTIES ('input.regex'='^(\\d+?),(.*?),"(.*)",([0-9.]*),([0-9]*),([0-9]*).*')
location ....
TBLPROPERTIES("skip.header.line.count"="1")
;

Read this article for more details.

How to handle delimiter in column value?

1 Answers1

Linked