remove unwanted quotes and comma in csv file

Question

I need to remove unwanted quotes and commas from a csv file. Sample data as below

header1, header2, header3, header4
1,       "ABC",   BCD,     "EDG",GHT\2\TST"

The last column has some free text values which seems like a new column but it opend in excel then it look like this

EDG",GHT\2\TST

Please guide me in fixing this last column.

Tried this -

sed 's/","/|/g' $filename | sed 's/|",/||/g' | sed 's/|,"/|/g' | sed 's/",/ /g' | sed 's/^.//' | awk '{print substr($0, 1, length($0)-1)}' | sed 's/,/ /g' | sed 's/"/ /g' | sed 's/|/,/g' > "out_"$filename

Can you please share what you want the last column to look like? — Aditya Patel, May 02 '19 at 15:08
The last column value should look like EDG GHT\2\TST after removing quote and comma so that I can load it in Hive — Arvind Singh Kamlakar, May 02 '19 at 15:14
I assumed that this was coming from a pandas dataframe, but perhaps its not. Is this just a file? I think you can tell excel to use tab delimiter rather than comma delimiter and it should read it properly? — erncyp, May 02 '19 at 15:25

score 0 · Answer 1 · answered May 02 '19 at 15:18

0

this should find " or , from columns and replace it with nothing

df = df.str.replace('[",]','',regex=True)

answered May 02 '19 at 15:18

erncyp

1,649
21
23

But this can replace valid ", from column header 2 value "ABC", which I don't want I only want this to be replace from any value which is already within "" like "EDG",GHT\2\TST" and not for "ABC", – Arvind Singh Kamlakar May 02 '19 at 15:23
Is it just a matter of getting the correct regex then? Im not that good at regex, so this isn't that straightforward to me. – erncyp May 02 '19 at 15:38
does this regex work? `"("|,)[.]*` – erncyp May 02 '19 at 15:45
Able to do by used sed and awk combination as below ``` sed 's/","/|/g' $filename | sed 's/|",/||/g' | sed 's/|,"/|/g' | sed 's/",/ /g' | sed 's/^.//' | awk '{print substr($0, 1, length($0)-1)}' | sed 's/,/ /g' | sed 's/"/ /g' | sed 's/|/,/g' > "out_"$filename ``` – Arvind Singh Kamlakar May 10 '19 at 18:52

Théo B. · Answer 2 · 2019-05-02T15:30:36.700

0

You can do it like this :

with open("data.txt", "r") as f:
        for line in f.readlines():
                columns = line.split(", ") # Split by ", "
                columns[3] = "".join(columns[3:]) # Merge columns 4 to ... last
                columns[3] = columns[3].replace("\"", "").replace(",", "")` # Removing unwanted characters
                del columns[4:] # Remove all unnecessary columns

                print("%s | %s | %s | %s" % (columns[0], columns[1], columns[2], columns[3]))

My data.txt file :

1, "ABC", BCD, "EDG",GHT\2\TST"
2, "CBA", DCB, "GDV",DHZ,\2RS"

Output :

1 | "ABC" | BCD | EDGGHT\2\TST

2 | "CBA" | DCB | GDVDHZ\2RS

This solution will works if only last column contains commas.

edited May 02 '19 at 15:30

answered May 02 '19 at 15:24

Théo B.

13
4

Thanks Theo for help but such values can come in any column in the csv file – Arvind Singh Kamlakar May 02 '19 at 15:32
So you can't determine which column is which column ... CSV file use commas for separating columns. Maybe you can add escape for comma before exporting the data ? – Théo B. May 02 '19 at 15:38
That's correct. When I open the csv in MS Excel then it open it correctly and able to show me which column is which, but when I load this csv in Hive external table using opencsv serde then values get shifted due to unwanted ", combination. – Arvind Singh Kamlakar May 02 '19 at 15:42
Some systems use `, ` ( comma and space ) for splitting columns, is always your case ? – Théo B. May 03 '19 at 08:48
Its separated by comma only. – Arvind Singh Kamlakar May 03 '19 at 13:08
Able to do by used sed and awk combination as below ``` sed 's/","/|/g' $filename | sed 's/|",/||/g' | sed 's/|,"/|/g' | sed 's/",/ /g' | sed 's/^.//' | awk '{print substr($0, 1, length($0)-1)}' | sed 's/,/ /g' | sed 's/"/ /g' | sed 's/|/,/g' > "out_"$filename ``` – Arvind Singh Kamlakar May 10 '19 at 18:52

remove unwanted quotes and comma in csv file

2 Answers2