0

I am using the following code to read the CSV file in PySpark

cb_sdf = sqlContext.read.format("csv") \
                        .options(header='true', 
                                 multiLine = 'True', 
                                 inferschema='true', 
                                 treatEmptyValuesAsNulls='true') \
                        .load(cb_file)

The number of rows is correct. But for some rows, the columns are separated incorrectly. I think it is because the current delimiter is ",", but some cells contain ", " in the text as well.

For example, the following row in the pandas dataframe(I used pd.read_csv to debug)

Unnamed: 0 name domain industry locality country size_range
111 cjsc "transport, customs, tourism" ttt-w.ru package/freight delivery vyborg, leningrad, russia russia 1 - 10

becomes

_c0 name domain industry locality country size_range
111 "cjsc ""transport customs tourism""" ttt-w.ru package/freight delivery vyborg, leningrad, russia

when I implemented pyspark.

It seems the cell "cjsc "transport, customs, tourism"" is separated into 3 cells: |"cjsc ""transport| customs| tourism"""|.

How can I set the delimiter to be exactly "," without any whitespace followed?

UPDATE:

I checked the CSV file, the original line is:

111,"cjsc ""transport, customs, tourism""",ttt-w.ru,package/freight delivery,"vyborg, leningrad, russia",russia,1 - 10

So is it still the problem of delimiter, or is it the problem of quotes?

jf3440
  • 1
  • 1
  • Please post sample data a text, not as images; see [ask]. If the field in csv contains a comma, the field needs to be in quotes. If your csv fields are not quoted, check with the producer of the broken output. – Robert Apr 21 '22 at 14:37
  • how about [trimming](https://spark.apache.org/docs/3.0.1/api/python/pyspark.sql.html#pyspark.sql.functions.trim) those columns after reading? – pltc Apr 21 '22 at 17:05

1 Answers1

0

I think that separating we'll have:

col1: 111 col2: "cjsc ""transport, customs, tourism""" col3: ttt-w.ru,package/freight delivery col4: "vyborg, leningrad, russia" col5: russia col6: 1 - 10

Amanda
  • 1
  • use cb_sdf = sqlContext.read.format("csv") \ .options(header='true', sep=',', multiLine = 'True', inferschema='true', treatEmptyValuesAsNulls='true') \ .load(cb_file) – Amanda Oct 05 '22 at 21:09