2

This is no duplicate of another question, as I do not want to drop the rows. The accepted answer in the aforementioned post is very different from this one, and not aimed at maintaining all the data.

Problem: Delimiter inside column data from badly formatted csv-file

Tried solutions: csv module , shlex, StringIO (no working solution on SO)

Example data

Delimiters are inside the third data field, somewhere enclosed by (multiple) double-quotes:

08884624;6/4/2016;Network routing 21,5\"\" 4;8GHz1TB hddQwerty\"\";9999;resell:no;package:1;test
0085658;6/4/2016;Logic 111BLACK.compat: 29,46 cm (11.6\"\")deep: 4;06 cm height: 25;9 cm\"\";9999;resell:no;package:1;test
4235846;6/4/2016;Case Logic. compat: 39,624 cm (15.6\"\") deep: 3;05 cm height: 3 cm\"\";9999;resell:no;package:1;test
400015;6/4/2016;Cable\"\"Easy Cover\"\"\"\";1;5 m 30 Silver\"\";9999;resell:no;package:1;test
9791118;6/4/2016;Network routing 21,5\"\" (2013) 2;7GHz\"\";9999;resell:no;package:1;test
477000;6/4/2016;iGlaze. deep: 9,6 mm (67.378\"\") height: 14;13 cm\"\";9999;resell:no;package:1;test
4024001;6/4/2016;DigitalBOX. tuner: Digital, Power: 20 W., Diag: 7,32 cm (2.88\"\"). Speed 10;100 Mbit/s\"\";9999;resell:no;package:1;test

Desired sample output

Fixed length of 7:

['08884624','6/4/2016', 'Network routing 21,5\" 4,8GHz1TB hddQwerty', '9999', 'resell:no', 'package:1', 'test']

Parsing through csv reader doesn't fix the problem (skipinitialspace is not the problem), shlex is no use and StringIO is also of no help...

My initial idea was to import row by row, and replace ';' element by element in row. But the importing is the problem, as it splits on every ';'.

The data comes from a larger file with 300.000+ rows (not all the rows have this problem). Any advice is welcome.

Community
  • 1
  • 1
  • Possible duplicate of [Pandas dataframe read\_csv on bad data](http://stackoverflow.com/questions/33440805/pandas-dataframe-read-csv-on-bad-data) – philshem Apr 06 '17 at 09:57

2 Answers2

3

As you know the number of input fields, and as only one field is badly formatted, you can simply split on ; and then combine back the median fields into one single one:

for line in file:
    temp_l = line.split(';')
    lst = temp_l[:2] + [ ';'.join(l[2:-4]) ] + l[-4:] #lst should contain the expected fields

I did not even try to process the double quotes, because I could not understand how you pass from Network routing 21,5\"\" 4;8GHz1TB hddQwerty\"\" to 'Network routing 21,5\" 4,8GHz1TB hddQwerty'...

Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
0

you can use the standart csv module.

To achieve what you are trying to accomplish just change the csv delimiter in question to ';'

Test the following in the terminal:

import csv 
test = ["4024001;6/4/2016;DigitalBOX. tuner: Digital, Power: 20 W., Diag: 7,32 cm (2.88\"\"). Speed 10;100 Mbit/s\"\";9999;resell:no;package:1;test"]

delimited_colon = list(csv.reader(b, delimiter=";", skipinitialspace=True))
Ilhicas
  • 1,429
  • 1
  • 19
  • 26