1

We are having a file, which is of the following type:

1- Sam, Joshua , "52 DD dr,
   Lake Hiawatha" , New Jersey, 07034 
2- Ruchi,kumari,SNN Raj serenity,Bengaluru, 560068

The line 1 is split into 2 rows in the External table with the rest of the columns being null in 1st row and 2nd row is having rest of the data.

Need assistance on what is the best way to load in a single column to overcome this issue. Went through a couple of solutions in the web , but was not clear.

Tried the following options:

1) Used the Regex Serde

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' 
WITH SERDEPROPERTIES (
"input.regex" = '"*([^"]*)"*,"*([^"]*)"*'
)

but it did not work

2) CSVInputFormat from github https://github.com/mvallebr/CSVInputFormat

But not able to use it.

Sam Berchmans
  • 127
  • 13

1 Answers1

0

I tried the following option and it worked for me ,

1) Regex tester - for this new line scenario the regex is very complicated , and it is not working.

2) Use CVS parser provided by https://github.com/mvallebr/CSVInputFormat and also had a chat with him on how to use it. Tried multiple options but not working.

3) The quick simple fix is to try the legacy method to replace the new lines in the File using shell or Perl command and it worked smoothly. Seems that to be a more feasible and easy option.

Sam Berchmans
  • 127
  • 13