0

I am trying to parse a csv file with the following contents:

# country,title1,title2,type
GB,Fast Friends,Burn Notice, S:4, E:2,episode,
SE,The Spiderwick Chronicles,"SPIDERWICK CHRONICLES, THE",movie,

The expected output is:

['SE', 'The Spiderwick Chronicles', '"SPIDERWICK CHRONICLES, THE"', 'movie']
['GB', 'Fast Friends', 'Burn Notice, S:4, E:2', 'episode']

The problem is, the commas in the 'title' fields are not escaped. I tried using csvreader as well as doing string and regex parsing, but was unable to get unambiguous matches.

Is it possible at all to parse this file accurately with unescaped commas on half of the fields? Or, does it require that a new csv be created?

David542
  • 104,438
  • 178
  • 489
  • 842

3 Answers3

2

You may be able to play a trick if you can make the assumption that all commas will appear in title2. Otherwise, you have ambiguous data.

strings = ['SE,The Spiderwick Chronicles,"SPIDERWICK CHRONICLES, THE",movie,'
          ,'GB,Fast Friends,Burn Notice, S:4, E:2,episode,'
          ]
for string in strings:
    xs = string.split(',')
    country = xs[0]
    title1  = xs[1]
    title2  = ' '.join(xs[2:-2])
    mtype   = xs[-2]
    print [country, title1, title2, mtype]

Output:

['SE', 'The Spiderwick Chronicles', '"SPIDERWICK CHRONICLES  THE"', 'movie']
['GB', 'Fast Friends', 'Burn Notice  S:4  E:2', 'episode']
0

You can use RegEx (import re) - see documentation

Match for (\".*\",)|(.*,)
This way you're looking either for [quoted string,] or [any string,].

alonre
  • 94
  • 3
0

If there are commas in the fields, I would save the excel as text file with fields separated by tab.

James Bear
  • 434
  • 2
  • 4