1

Can someone please guide me how should I Load data in hive where I am getting " in some rows and in some rows data is coming without " for the same column value.

    Sample Data:

    id,name,desc,uqc,roll,age
    1,Monali,"abhc,jkjk",,23,23
    2,mj,nhiijkla,67,23,60
    7,jena,"kdjuu,hsysi,juw",3,34,23
    1,Monali,"/"coppers bars","rods and profiles"/",,23,23
    2,money,"/"COUPLING","FLANGES & CROSS OVER"/",67,23,60

In above data for id '2' " are not there in desc column value.

My Create Statement:

    create external table testing(id int, 
                  name string, 
                  desc string, 
                  uqc double, 
                  roll int, 
                  age int
                 ) 
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' 
    WITH SERDEPROPERTIES ('input.regex'='^(\\d+?),(.*?),"(.*)",([0-9.]*),([0-9]*),([0-9]*).*')
    location ....
    TBLPROPERTIES("skip.header.line.count"="1")
    ;

While loading Data I'm not getting any error. But when I doing select * from testing.select statement is not Executing. Above Create and select statement working fine If data is coming with ", but is not working if data is coming with and without ".

leftjoin
  • 36,950
  • 8
  • 57
  • 116
Rahul Patidar
  • 189
  • 1
  • 1
  • 14
  • can you pls use openCSVSerde? `CREATE EXTERNAL TABLE tab ( desc STRING, id BIGINT ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ( "separatorChar" = ",", "quoteChar" = "\"") LOCATION '/your/dir/location/';` – Koushik Roy Jan 22 '21 at 17:30
  • @KoushikRoy I tried but in case of this row:--> 2,money,"17",19"LCD PANEL FOR COMPUTER",67,23,60 its not working. "17",19"LCD PANEL FOR COMPUTER" is the value of my desc column value but its loading 17 in desc column and 19"LCD PANEL FOR COMPUTER loading in next column. – Rahul Patidar Jan 22 '21 at 20:30
  • dont you think this is inconsistent ? can you get all columns enclosed with double quote? This will ensure good data quality. If its not possible then i am not sure how can we load it. – Koushik Roy Jan 22 '21 at 20:50

2 Answers2

1

Currently third group in the regexp is enclosed in quotes (quotes are mandatory). Try to make quotes optional "? - means zero or one quote, also make group content non-greedy (.*?), so it will not catch extra quote inside the group:

'input.regex'='^(\\d+?),(.*?),"?(.*?)"?,([0-9.]*),(\\d*),(\\d*).*' 

Testing your data example using regexp_replace, I also added optional slash around 3rd group to remove it from the output:

with mytable as (
select stack(6,
    '1,Monali,"abhc,jkjk",,23,23',
    '2,mj,nhiijkla,67,23,60',
    '7,jena,"kdjuu,hsysi,juw",3,34,23',
    '1,Monali,"/"coppers bars","rods and profiles"/",,23,23',
    '2,money,"/"COUPLING","FLANGES & CROSS OVER"/",67,23,60',
    '2,money,"17",19"LCD PANEL FOR COMPUTER",67,23,60'
) as initial_data
)

select regexp_replace(initial_data,'^(\\d+?),(.*?),"?/?(.*?)/?"?,([0-9.]*),(\\d*),(\\d*).*',
                                   '$1 || $2 || $3 || $4 || $5 || $6'
                     ) as parsed_result
 from mytable

Result (delimited by two pipes and spaces ' || '):

parsed_result
1 || Monali || abhc,jkjk || || 23 || 23
2 || mj || nhiijkla || 67 || 23 || 60
7 || jena || kdjuu,hsysi,juw || 3 || 34 || 23
1 || Monali || "coppers bars","rods and profiles" || || 23 || 23
2 || money || "COUPLING","FLANGES & CROSS OVER" || 67 || 23 || 60
2 || money || 17",19"LCD PANEL FOR COMPUTER || 67 || 23 || 60

So, if result looks good, use this regexp in the table DDL:

'input.regex'='^(\\d+?),(.*?),"?/?(.*?)/?"?,([0-9.]*),(\\d*),(\\d*).*'

Carefully test it on the whole dataset and check for empty/null values, fix regexp if necessary.

leftjoin
  • 36,950
  • 8
  • 57
  • 116
  • It's worked For me and i accepted the answer. However when I am applying the same logic in case of multiple Column i.e. I have 68 Columns in my table.so after "some value , its going in next column. i.e. "College,scince and Business" so College is coming in desc column but scince and Business are coming in next column Can u Please guide Me how should I extend the same logic for different column count. I have 68 column in my table and want to implement the same logic for column number 51.I am adding my regex Expression in next Comment. Kindly Help.@leftjoin – Rahul Patidar Jan 24 '21 at 13:33
  • 'input.regex'='(.*?),(.*?),(.*?),(.*?),(.*?),(.*?),(.*?),(.*?),(.*?),(.*?),(.*?),(.*?),(.*?),(.*?),(.*?),(.*?),(.*?),(.*?),(.*?),(.*?),(.*?),(.*?),(.*?),(.*?),(.*?),(.*?),(.*?),(.*?),(.*?),(.*?),(.*?),(.*?),(.*?),(.*?),(.*?),(.*?),(.*?),(.*?),(.*?),(.*?),(.*?),(.*?),(.*?),(.*?),(.*?),(.*?),(.*?),(.*?),(.*?),(.*?),"?/?(.*?)/?"?,(.*?),(.*?),(.*?),(.*?),(.*?),(.*?),(.*?),(.*?),(.*?),(.*?),(.*?),(.*?),(.*?),(.*?),(.*?),(.*?),(.*?)') – Rahul Patidar Jan 24 '21 at 13:34
  • @Varun Read the article: https://community.cloudera.com/t5/Community-Articles/Using-Regular-Expressions-to-Extract-Fields-for-Hive-Tables/ta-p/247562 Start with one column and also use string beginning anchor ^ like this : `'^(.*?),.*'` add columns one by one and use regexp_replace to debug it. – leftjoin Jan 24 '21 at 18:11
  • @Varun In some cases when column can contain comma(delimiter) and can be not quoted and next column also can contain comma(delimiter) and also can be not quoted, then it is not possible to formalize a rule how these columns should be extracted. Columns containing digits only or always quoted columns + end of the string anchor - $ can help in this case. But you should understand the rule how to extract. If you understand, then it is possible – leftjoin Jan 24 '21 at 18:21
0

Try with this tag:

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
diogoramos
  • 86
  • 7
  • I tried but in case of this row -----> 2,money,"17",19"LCD PANEL FOR COMPUTER",67,23,60 its not working. "17",19"LCD PANEL FOR COMPUTER" is the value of my desc column value but its loading 17 in desc column and 19"LCD PANEL FOR COMPUTER loading in next column.@diogoramos – Rahul Patidar Jan 22 '21 at 20:40