I am using the following code to read the CSV file in PySpark
cb_sdf = sqlContext.read.format("csv") \
.options(header='true',
multiLine = 'True',
inferschema='true',
treatEmptyValuesAsNulls='true') \
.load(cb_file)
The number of rows is correct. But for some rows, the columns are separated incorrectly. I think it is because the current delimiter is ",", but some cells contain ", " in the text as well.
For example, the following row in the pandas dataframe(I used pd.read_csv to debug)
Unnamed: 0 | name | domain | industry | locality | country | size_range |
---|---|---|---|---|---|---|
111 | cjsc "transport, customs, tourism" | ttt-w.ru | package/freight delivery | vyborg, leningrad, russia | russia | 1 - 10 |
becomes
_c0 | name | domain | industry | locality | country | size_range |
---|---|---|---|---|---|---|
111 | "cjsc ""transport | customs | tourism""" | ttt-w.ru | package/freight delivery | vyborg, leningrad, russia |
when I implemented pyspark.
It seems the cell "cjsc "transport, customs, tourism"" is separated into 3 cells: |"cjsc ""transport| customs| tourism"""|.
How can I set the delimiter to be exactly "," without any whitespace followed?
UPDATE:
I checked the CSV file, the original line is:
111,"cjsc ""transport, customs, tourism""",ttt-w.ru,package/freight delivery,"vyborg, leningrad, russia",russia,1 - 10
So is it still the problem of delimiter, or is it the problem of quotes?