1

I have a csv file that comes from outside, so I am not able to change its definitions or formatting and I have to read it into a database.

The file has some things that look very odd for a csv format and I would like to know if there is a way of importing it using csv.reader or pandas without having to make some specific corrections using regex or similar.

The file uses pipe (|) as delimiter. Look at this two example lines:

1. '11|2000-02-18|Badjora|22|33|44|55|""|""Wood 6.5"" Long Line (Pine - Category:15 m | Humidity max: 13 g/m3)""|4'

2. '333|1999-02-18|Badjora|444|555|666|777|""|"""|4'

Is there a way of interpreting this into the following fields?

1. 11,2000-02-18,Badjora,22,33,44,55,,"Wood 6.5"" Long Line (Pine - Category:15 m | Humidity max: 13 g/m3)",4

2. 333,1999-02-18,Badjora,444,555,666,777,,",4

Something like "consider everything enclosed between quotes".

I tried to play with the different options of import settings but with no success. It seems that this format is not valid for csv.

Thank you very much in advance.

Regards to all

vascosp7
  • 23
  • 3

2 Answers2

0

Assuming your .csv file looks like this:

11|2000-02-18|Badjora|22|33|44|55|""|""Wood 6.5"" Long Line (Pine - Category:15 m | Humidity max: 13 g/m3)""|4
333|1999-02-18|Badjora|444|555|666|777|""|"""|4

You can read that file just fine with this:

import csv

with open("sample.csv") as sample:
    reader = csv.reader(sample, delimiter="|")
    for row in reader:
        print(row)

This nicely outputs:

['11', '2000-02-18', 'Badjora', '22', '33', '44', '55', '', 'Wood 6.5"" Long Line (Pine - Category:15 m ', ' Humidity max: 13 g/m3)""', '4']
['333', '1999-02-18', 'Badjora', '444', '555', '666', '777', '', '"|4\n']

If, however, you want to fix the delimiter, use this:

with open("sample.csv") as sample, open("fixed_sample.csv", "w", newline="") as outfile:
    reader = csv.reader(sample, delimiter="|")
    writer = csv.writer(outfile, delimiter=",")
    for item in reader:
        writer.writerow(item)
baduker
  • 19,152
  • 9
  • 33
  • 56
  • Thank you for your answer. I tried that, obviously, but it does not parse the fields as I required. For instance, in the last but one field, your solution splits the field into two fields. Your output has 11 fields in the first line and 9 in the second line and the number of fields should be exactly the same. – vascosp7 Feb 19 '21 at 11:55
  • Well, that's what you gave as an example, so I used what I had. You might have to pre-process the file before importing it or make sure the data comes in the right shape in the first place. – baduker Feb 19 '21 at 11:56
  • Yeah, but I also gave the desired output. If I could parse it easily as a regular csv the problem would not exist at all. But thank you very much for your answer anyway. :) – vascosp7 Feb 19 '21 at 12:01
0

Came up with this hacky regex delimiter with pandas that would work for your given example. I don't recommend doing this and advice to actually fix the csv using regex. Anyway here you go:

pd.read_csv('sample.csv', sep=r'"?(?<!\s)\|(?!\s)"?', engine='python', header=None)

There are still areas this could break for example a pipe without any spaces near it. Also an actual separator pipe having spaces before / after it. The main problem is your csv has unescaped quotes causing these problems.

Abdul Aziz Barkat
  • 19,475
  • 3
  • 20
  • 33
  • Thank you very much for your help, Abdul. Yeah, using regex was the solution I was trying to avoid, but it really seems my only chance. – vascosp7 Feb 19 '21 at 15:27