Ambiguity in parsing csv file

Question

I am trying to parse a csv file with the following contents:

# country,title1,title2,type
GB,Fast Friends,Burn Notice, S:4, E:2,episode,
SE,The Spiderwick Chronicles,"SPIDERWICK CHRONICLES, THE",movie,

The expected output is:

['SE', 'The Spiderwick Chronicles', '"SPIDERWICK CHRONICLES, THE"', 'movie']
['GB', 'Fast Friends', 'Burn Notice, S:4, E:2', 'episode']

The problem is, the commas in the 'title' fields are not escaped. I tried using csvreader as well as doing string and regex parsing, but was unable to get unambiguous matches.

Is it possible at all to parse this file accurately with unescaped commas on half of the fields? Or, does it require that a new csv be created?

possible duplicate of [Python: read CSV file with comma within fields](http://stackoverflow.com/questions/8311900/python-read-csv-file-with-comma-within-fields) — Ankit Jaiswal, Mar 05 '15 at 04:41
@AnkitJaiswal that's not a duplicate, those are items enclosed within quotation marks. — David542, Mar 05 '15 at 04:42
what's the expected output for `GB,Fast,Friends,four,Burn Notice, S:4, E:2,episode,` ? How i differentiate that this part comes under title 1 and this part comes under title 2? — Avinash Raj, Mar 05 '15 at 04:44
@David542 if your value which contains commas is not enclosed in quotes, it will be treated as separated cells. — Ankit Jaiswal, Mar 05 '15 at 04:49

score 2 · Accepted Answer · answered Mar 05 '15 at 04:47

You may be able to play a trick if you can make the assumption that all commas will appear in title2. Otherwise, you have ambiguous data.

strings = ['SE,The Spiderwick Chronicles,"SPIDERWICK CHRONICLES, THE",movie,'
          ,'GB,Fast Friends,Burn Notice, S:4, E:2,episode,'
          ]
for string in strings:
    xs = string.split(',')
    country = xs[0]
    title1  = xs[1]
    title2  = ' '.join(xs[2:-2])
    mtype   = xs[-2]
    print [country, title1, title2, mtype]

Output:

['SE', 'The Spiderwick Chronicles', '"SPIDERWICK CHRONICLES  THE"', 'movie']
['GB', 'Fast Friends', 'Burn Notice  S:4  E:2', 'episode']

score 0 · Answer 2 · answered Mar 05 '15 at 05:29

0

You can use RegEx (import re) - see documentation

Match for (\".*\",)|(.*,)
This way you're looking either for [quoted string,] or [any string,].

answered Mar 05 '15 at 05:29

alonre

94
3

score 0 · Answer 3 · answered Mar 05 '15 at 06:26

0

If there are commas in the fields, I would save the excel as text file with fields separated by tab.

answered Mar 05 '15 at 06:26

James Bear

434
2
4

Ambiguity in parsing csv file

3 Answers3