-1

I've got a csv file, where the first elements in each row contain sometimes an additional comma as part of the name. Since not all elements are put in quotation marks it will result in a messed up dataframe. I want to regard only the last n commas as commas, additional in the beginning should be deleted.

Example:
"name", "rank", "wealth"
Donald Trump, Jr, "45", "3"
Barack Obama, 44, 0

In this example, meaning dropping the first comma only in rows with more than 2 commas.

  • Please provide enough code so others can better understand or reproduce the problem. – Community Jul 22 '22 at 16:01
  • Not certain but I doubt if you can do this with pandas. You'll probably need to pre-process the file yourself. Show an example of the CSV and the code you've tried – DarkKnight Jul 22 '22 at 16:02
  • It sounds like this is not a properly formatted CSV. If you have any control over the thing that writes it, fix that. – tdelaney Jul 22 '22 at 16:04
  • I added an example. The problem is I have got a very large CSV that is messed up already. – dustsucker Jul 22 '22 at 16:08

2 Answers2

0

On the basis of the same data shown, you could do this. However, if your input file is very badly messed up (i.e., your sample is not truly representative) then this may not work.

Note:

The input file for this code is a copy/paste of the sample given in the question

import pandas as pd
import io

MAXCOMMAS = 2

iob = io.StringIO()

with open('csv.csv') as data_in:
    for line in data_in:
        if (commas := line.count(',')) > MAXCOMMAS:
            line = line.replace(',', '', commas-MAXCOMMAS)
        iob.write(line.replace('"', ''))

iob.seek(0)
df = pd.read_csv(iob)
print(df)

Output:

              name   rank   wealth
0  Donald Trump Jr     45        3
1     Barack Obama     44        0
DarkKnight
  • 19,739
  • 3
  • 6
  • 22
0

You could scrub the file with the csv module before using pandas. The rule here is that if a row has more than 21 columns, all of the extra columns on the left (for left-to-right readers) are collapsed into a single comma-separated cell that is properly quoted by the csv module on write. The csv module is used so that anything that is already properly quoted or escaped isn't messed up in the process.

with open("test.csv", newline="") as infile, open("test-fixed.csv", "w", newline="") as outfile:
    reader = csv.reader(infile)
    writer = csv.writer(outfile)
    for row in reader:
        if len(row) > 21:
            n = len(row) - 21 + 1
            row[:n] = [",".join(row[:n])]
            print(n, len(row))
        writer.writerow(row)
tdelaney
  • 73,364
  • 6
  • 83
  • 116