I have a stack of CSV files I want to parse - the problem is half of the have quote marks used as quote marks, and commas inside main field. They are not really CSV, but they do have a fixed number of fields that are identifiable. The dialect=csv."excel" setting works perfectly on files with out the extra " and , chars inside the field.
This data is old/unsupported. I am trying to push some life into it.
e.g.
"AAAAA
AAAA
AAAA
AAAA","AAAAAAAA
AAAAAA
AAAAA "AAAAAA" AAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAA, AAAAA
AAAAAAAAA AAAAA AAAAAAAAAA
AAAAA, "AAAAA", AAAAAAAAA
AAAAAAAA AAAAAAAA
AAAAAAA
"
This is tripping the file parser, and throws an error _csv.Error: newline inside string
. I narrrowed it down to this being the issue by removing the quote marks from inside the 2nd field and the csv.reader module parses the file OK.
Some of the fields are multi line - I'm not sure if thats important to know.
I have been poking around at the dialect settings, and whilst I can find 'skipinitialspace', this doesn't seem to solve the problem.
To be clear - this is not valid 'CSV', its data objects that loosely follow a CSV structure, but have , and " chars inside the field test.
The lineterminator is \x0d\x0a
I have tried a number of goes at differnt permuations of doublequote and the quoting variable in the dialect module, but I can't get this parse correctly.
I can not be confident that a ," or ", combination exists only on field boundaries.
This problem only exists for one (the last) of several fields in the file, and there are several thousand files.