2

I am new to Azure data lake analytics, I am trying to load a csv which is double quoted for sting and there are quotes inside a column on some random rows.

For example

ID, BookName
1, "Life of Pi"
2, "Story about "Mr X""

When I try loading, it fails on second record and throwing an error message. 1, I wonder if there is a way to fix this in csv file, unfortunatly we cannot extract new from source as these are log files?
2, is it possible to let ADLA to ignore the bad rows and proceed with rest of the records?

Execution failed with error '1_SV1_Extract Error : '{"diagnosticCode":195887146,"severity":"Error","component":"RUNTIME","source":"User","errorId":"E_RUNTIME_USER_EXTRACT_ROW_ERROR","message":"Error occurred while extracting row after processing 9045 record(s) in the vertex' input split. Column index: 9, column name: 'instancename'.","description":"","resolution":"","helpLink":"","details":"","internalDiagnostics":"","innerError":{"diagnosticCode":195887144,"severity":"Error","component":"RUNTIME","source":"User","errorId":"E_RUNTIME_USER_EXTRACT_EXTRACT_INVALID_CHARACTER_AFTER_QUOTED_FIELD","message":"Invalid character following the ending quote character in a quoted field.","description":"Invalid character is detected following the ending quote character in a quoted field. A column delimiter, row delimiter or EOF is expected.\nThis error can occur if double-quotes within the field are not correctly escaped as two double-quotes.","resolution":"Column should be fully surrounded with double-quotes and double-quotes within the field escaped as two double-quotes."

Gilles-Antoine Nys
  • 1,481
  • 16
  • 21

1 Answers1

1

As per the error message, if you are importing a quoted csv, which has quotes within some of the columns, then these need to be escaped as two double-quotes. In your particular example, you second row needs to be:

..."Life after death and ""good death"" models - a qualitative study",...

So one option is to fix up the original file on output. If you are not able to do this, then you can import all the columns as one column, use RegEx to fix up the quotes and output the file again, eg

// Import records as one row then use RegEx to clean columns
@input =
    EXTRACT oneCol string
    FROM "/input/input132.csv"
    USING Extractors.Text( '|', quoting: false );


// Fix up the quotes using RegEx
@output =
    SELECT Regex.Replace(oneCol, "([^,])\"([^,])", "$1\"\"$2") AS cleanCol
    FROM @input;    

OUTPUT @output
TO "/output/output.csv"
USING Outputters.Csv(quoting : false);

The file will now import successfully. My results:

My results

wBob
  • 13,710
  • 3
  • 20
  • 37
  • Hi WBob, Thanks for your reply. I am not sure where i gone wring with the above example. Can you please try the example below. Thanks – Anbarasan Dhanushkodi Jul 09 '18 at 20:57
  • "id","datetime","userid","idnumber","ip","course","shortname","module","cmid","instancename","action","url","info" "39804","03JUL13:04:36:15","5169","Y429","111.111.117.169","201467","284-13B","forumng","222332","TMA03 forum","view discussion","discuss.php?d=6041","d6041" "39807","03JUL13:04:36:16","175"," ","111.108.111.32","202248","K2-13J","url","351408","Life after death and "good death" models - a qualitative study","view","view.php?id=3518","479" – Anbarasan Dhanushkodi Jul 09 '18 at 21:00
  • by the way i have used the following code USING Extractors.Csv(quoting : true, silent : false, encoding :Encoding.UTF8); – Anbarasan Dhanushkodi Jul 09 '18 at 21:29
  • how about using just the [column name].Replace("\"","\"\"") at the SELECT @output? why does that still fail? is this because the Extractor is the one that fails? – dim_user Jul 16 '18 at 19:48
  • 1
    Hi @SaulCruz, I like the idea of simplifying the approach but I can't see how that would work? It's the initial import (`EXTRACT`) which fails. – wBob Jul 16 '18 at 20:33
  • @wBob makes totally sense, your solution is a nice workaround and thanks for that, however some of my use cases are more complex, for example we're merging tons of csv files that have different schemas, at the end we want one output with all the columns that ever existed in those files, and when those columns did not exist the values will be NULL or empty for those, any ideas? thanks again. – dim_user Jul 18 '18 at 00:35
  • 1
    U-SQL supports `OUTER UNION` which sounds like it might help with your use case, but this sounds like a separate question. Consider raising a separate question with some sample data and expected results and I’m sure someone will help you @SaulCruz. – wBob Jul 18 '18 at 00:49