Python: Issue with delimiter inside column data

Question

This is no duplicate of another question, as I do not want to drop the rows. The accepted answer in the aforementioned post is very different from this one, and not aimed at maintaining all the data.

Problem: Delimiter inside column data from badly formatted csv-file

Tried solutions: csv module , shlex, StringIO (no working solution on SO)

Example data

Delimiters are inside the third data field, somewhere enclosed by (multiple) double-quotes:

08884624;6/4/2016;Network routing 21,5\"\" 4;8GHz1TB hddQwerty\"\";9999;resell:no;package:1;test
0085658;6/4/2016;Logic 111BLACK.compat: 29,46 cm (11.6\"\")deep: 4;06 cm height: 25;9 cm\"\";9999;resell:no;package:1;test
4235846;6/4/2016;Case Logic. compat: 39,624 cm (15.6\"\") deep: 3;05 cm height: 3 cm\"\";9999;resell:no;package:1;test
400015;6/4/2016;Cable\"\"Easy Cover\"\"\"\";1;5 m 30 Silver\"\";9999;resell:no;package:1;test
9791118;6/4/2016;Network routing 21,5\"\" (2013) 2;7GHz\"\";9999;resell:no;package:1;test
477000;6/4/2016;iGlaze. deep: 9,6 mm (67.378\"\") height: 14;13 cm\"\";9999;resell:no;package:1;test
4024001;6/4/2016;DigitalBOX. tuner: Digital, Power: 20 W., Diag: 7,32 cm (2.88\"\"). Speed 10;100 Mbit/s\"\";9999;resell:no;package:1;test

Desired sample output

Fixed length of 7:

['08884624','6/4/2016', 'Network routing 21,5\" 4,8GHz1TB hddQwerty', '9999', 'resell:no', 'package:1', 'test']

Parsing through csv reader doesn't fix the problem (skipinitialspace is not the problem), shlex is no use and StringIO is also of no help...

My initial idea was to import row by row, and replace ';' element by element in row. But the importing is the problem, as it splits on every ';'.

The data comes from a larger file with 300.000+ rows (not all the rows have this problem). Any advice is welcome.

Possible duplicate of [Pandas dataframe read\_csv on bad data](http://stackoverflow.com/questions/33440805/pandas-dataframe-read-csv-on-bad-data) — philshem, Apr 06 '17 at 09:57

score 3 · Accepted Answer · answered Apr 06 '17 at 10:22

As you know the number of input fields, and as only one field is badly formatted, you can simply split on ; and then combine back the median fields into one single one:

for line in file:
    temp_l = line.split(';')
    lst = temp_l[:2] + [ ';'.join(l[2:-4]) ] + l[-4:] #lst should contain the expected fields

I did not even try to process the double quotes, because I could not understand how you pass from Network routing 21,5\"\" 4;8GHz1TB hddQwerty\"\" to 'Network routing 21,5\" 4,8GHz1TB hddQwerty'...

Thanks for the answer, Serge. Sometimes it's a simple solution that works the best. — Robert_Angelo, Apr 06 '17 at 10:52

score 0 · Answer 2 · answered Apr 06 '17 at 09:58

0

you can use the standart csv module.

To achieve what you are trying to accomplish just change the csv delimiter in question to ';'

Test the following in the terminal:

import csv 
test = ["4024001;6/4/2016;DigitalBOX. tuner: Digital, Power: 20 W., Diag: 7,32 cm (2.88\"\"). Speed 10;100 Mbit/s\"\";9999;resell:no;package:1;test"]

delimited_colon = list(csv.reader(b, delimiter=";", skipinitialspace=True))

answered Apr 06 '17 at 09:58

Ilhicas

1,429
1
19
26

This doesn't work. There is no initial space between columns. Thanks for helping, though. – Robert_Angelo Apr 06 '17 at 10:49
What do you mean by there is no initial space? If there is not an initial space its just a matter of removing the skipinitialspace argument. – Ilhicas Apr 06 '17 at 10:58
Python will, for example, split on 10;100 (on which it should not). – Robert_Angelo Apr 06 '17 at 11:11
Ok, didn't see the 10;100 Mbit for example you needed joined. Than its a matter of joining the 3rd and 4th elements of the list. – Ilhicas Apr 06 '17 at 11:54

Python: Issue with delimiter inside column data

2 Answers2

Linked