8

I have a stack of CSV files I want to parse - the problem is half of the have quote marks used as quote marks, and commas inside main field. They are not really CSV, but they do have a fixed number of fields that are identifiable. The dialect=csv."excel" setting works perfectly on files with out the extra " and , chars inside the field.

This data is old/unsupported. I am trying to push some life into it.

e.g.

"AAAAA
AAAA
AAAA
AAAA","AAAAAAAA


AAAAAA
AAAAA "AAAAAA" AAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAA, AAAAA
AAAAAAAAA AAAAA AAAAAAAAAA
AAAAA, "AAAAA", AAAAAAAAA
AAAAAAAA AAAAAAAA
AAAAAAA
"

This is tripping the file parser, and throws an error _csv.Error: newline inside string. I narrrowed it down to this being the issue by removing the quote marks from inside the 2nd field and the csv.reader module parses the file OK.

Some of the fields are multi line - I'm not sure if thats important to know.

I have been poking around at the dialect settings, and whilst I can find 'skipinitialspace', this doesn't seem to solve the problem.

To be clear - this is not valid 'CSV', its data objects that loosely follow a CSV structure, but have , and " chars inside the field test.

The lineterminator is \x0d\x0a

I have tried a number of goes at differnt permuations of doublequote and the quoting variable in the dialect module, but I can't get this parse correctly.

I can not be confident that a ," or ", combination exists only on field boundaries.

This problem only exists for one (the last) of several fields in the file, and there are several thousand files.

Jay Gattuso
  • 3,890
  • 12
  • 37
  • 51
  • 3
    proper CSV uses two consecutive quote characters (`""`) to escape a quote character in a quoted field. Alternately, the field could be unquoted and no escape is neccesary, although commas and newlines cannot appear in such a field. Single quotes like this are not normally valid. Is it possible to modify the options of the source of your CSV data to better conform to the expectations of most other CSV readers (including python). – SingleNegationElimination Feb 10 '12 at 23:09

3 Answers3

11

Have you tried passing csv.QUOTE_NONE via the quoting keyword arg? Without having some code or data to test this on, I have no way to know whether this actually works on your data, but it seems to work with the fragment you provided.

>>> import csv
>>> r = csv.reader(open('foo.csv', 'rb'), quoting=csv.QUOTE_NONE)
>>> for row in r: print row
... 
['"A"', '"B"', '"ccc "ccccccc" cccccc"']
senderle
  • 145,869
  • 36
  • 209
  • 233
  • I did try that, but that spits the whole file broken up by white space and other junk. There is only 7 items in the csv list, that method gives hundreds. – Jay Gattuso Feb 10 '12 at 23:14
  • @Jay Gattuso, in that case, you should really post some more example data. What delimiter does the file use? If it has quotes and commas scattered throughout the data, then it must be using a different delimiter. Or else it's just not a well-formed csv file. – senderle Feb 10 '12 at 23:17
  • The delimiter is a standard comma, its simply not well formed. I am starting to think I might have pre-parse it, looking for the right conditions and replace the quotes there with doubles... I did think thats what the dialect module was for. I would post more examples, but its not sharable data. The problem is that its bad csv, not that its 'special'. – Jay Gattuso Feb 10 '12 at 23:20
  • "whole file broken up by white space"? How can that happen? Do you have random commas all over the file, also? – S.Lott Feb 10 '12 at 23:21
  • Nope, but the quotes_none setting results in whitespace being used as the delimiter. I tried setting the delimiter in the same csv.reader call to ',' but it resulted in the same list being created. As we have established, this meothed is not going to work for this poorly formed 'CSV' data. – Jay Gattuso Feb 10 '12 at 23:25
4

I'm not allowed to comment quite yet, so I'll post as an answer...

Assuming you are using commas as your delimiter, are there any commas within your data? If not, then you could do a massive find and replace to double all the quote characters after the first and before the last characters of the field prior to CSV processing.

Herbie
  • 141
  • 2
  • Sadly there are commas in the field text too, but I think I'll have to scrap the csv module for parsing - I can't trust that I won't find either a ," or a ", combination in the field text, which kind of busts the whole csv idea... – Jay Gattuso Feb 10 '12 at 23:31
  • Well, it was a thought. Within a certain data field for a database here at my work, I prohibited the use of commas. I was able to get away with it, though, only because there were only 100 records affected in a 50,000 row file. – Herbie Feb 10 '12 at 23:36
  • Its a nice idea, but I have several thousand files, each is pretty much unknown - wait - I can kind of do it that way - I know roughly where the legit " chars are in the file - I can find them, replace them with doubles, and hopefully make a valid csv that way. – Jay Gattuso Feb 10 '12 at 23:40
  • How many fields are affected by data with commas and quote chars? If it's only 1, the problem is still solvable. – Herbie Feb 10 '12 at 23:47
  • One field, but this field can be a few hundred lines of text. – Jay Gattuso Feb 10 '12 at 23:50
  • 1
    Split each record on all the commas and ignore the quote chars. You have a known number of fields on the left of your troublesome field and a known number of fields on the right. You can then put the remaining pieces back together and double all the quote chars at the same time. You should then have a valid CSV file. – Herbie Feb 11 '12 at 03:00
  • ahh - cool, excellent, thanks - I don't need the data in CSV anyway, I'm moving into a db, so the reshapeing part is not needed. I din't think to split the file, I was fixated on the CSV aspect. Thank you. – Jay Gattuso Feb 11 '12 at 03:17
2

I would write a converter that would parse the initial csv and output a valid one. You can possibly use the ", or "\n as the mechanism for determining the the delimiting.

Joshua
  • 26,234
  • 22
  • 77
  • 106
  • Thats tempting, but the valid field markers aren't always on \n boundaries, and the ", as a valid CSV marker assumption would be dangerous. – Jay Gattuso Feb 11 '12 at 00:24