How to read CSV in pyspark with "," delimiter but not ", "

Question

I am using the following code to read the CSV file in PySpark

cb_sdf = sqlContext.read.format("csv") \
                        .options(header='true', 
                                 multiLine = 'True', 
                                 inferschema='true', 
                                 treatEmptyValuesAsNulls='true') \
                        .load(cb_file)

The number of rows is correct. But for some rows, the columns are separated incorrectly. I think it is because the current delimiter is ",", but some cells contain ", " in the text as well.

For example, the following row in the pandas dataframe(I used pd.read_csv to debug)

Unnamed: 0	name	domain	industry	locality	country	size_range
111	cjsc "transport, customs, tourism"	ttt-w.ru	package/freight delivery	vyborg, leningrad, russia	russia	1 - 10

becomes

_c0	name	domain	industry	locality	country	size_range
111	"cjsc ""transport	customs	tourism"""	ttt-w.ru	package/freight delivery	vyborg, leningrad, russia

when I implemented pyspark.

It seems the cell "cjsc "transport, customs, tourism"" is separated into 3 cells: |"cjsc ""transport| customs| tourism"""|.

How can I set the delimiter to be exactly "," without any whitespace followed?

UPDATE:

I checked the CSV file, the original line is:

111,"cjsc ""transport, customs, tourism""",ttt-w.ru,package/freight delivery,"vyborg, leningrad, russia",russia,1 - 10

So is it still the problem of delimiter, or is it the problem of quotes?

Please post sample data a text, not as images; see [ask]. If the field in csv contains a comma, the field needs to be in quotes. If your csv fields are not quoted, check with the producer of the broken output. — Robert, Apr 21 '22 at 14:37
how about [trimming](https://spark.apache.org/docs/3.0.1/api/python/pyspark.sql.html#pyspark.sql.functions.trim) those columns after reading? — pltc, Apr 21 '22 at 17:05

score 0 · Answer 1 · answered Oct 05 '22 at 21:07

0

I think that separating we'll have:

col1: 111 col2: "cjsc ""transport, customs, tourism""" col3: ttt-w.ru,package/freight delivery col4: "vyborg, leningrad, russia" col5: russia col6: 1 - 10

answered Oct 05 '22 at 21:07

Amanda

1

use cb_sdf = sqlContext.read.format("csv") \ .options(header='true', sep=',', multiLine = 'True', inferschema='true', treatEmptyValuesAsNulls='true') \ .load(cb_file) – Amanda Oct 05 '22 at 21:09

How to read CSV in pyspark with "," delimiter but not ", "

1 Answers1