Read CSV file with variable number of commas as a data frame in PYSPARK

Question

I have a comma separated file, with no header, with different number of items in each line separated by a comma, such as:

a, x1, x2  
b, x3, x4, x5  
c, x6, x7, x8, x9

The first line contains only 3 items, subsequent lines contain more, so it seems the number of columns is inferred from first line only, so it skips whatever after 3rd comma in other lines and data is lost.

spark = init_spark()
df= spark.read.csv(filename)
print (df.take(3))

I get:

[Row(_c0='a', _c1=' x1', _c2=' x2'),  
Row(_c0='b', _c1=' x3', _c2=' x4'),   
Row(_c0='c', _c1=' x6', _c2=' x7')]

mode="PERMISSIVE" in the module pyspark.sql.readwriter
doesn't solve the problem, may be because there is no header

You could read it in as a single column and then use map to expand. — cs95, Feb 26 '19 at 02:45
Possible duplicate of [Importing text file with varying number of columns in Spark](https://stackoverflow.com/questions/50158696/importing-text-file-with-varying-number-of-columns-in-spark) — pault, Feb 26 '19 at 16:29

Ranga Vure · Answer 1 · 2020-01-22T17:21:43.780

0

Assuming max no of col or comma separated values known And given the file a.csv

col_a,col_b,col_c,col_d,col_e
1,2,3,4,5
1,2,3,e
1,a,b

schema = StructType([
    StructField("col_a", StringType(), True),
    StructField("col_b", StringType(), True),
    StructField("col_c", StringType(), True),
    StructField("col_d", StringType(), True),
    StructField("col_e", StringType(), True)
])

df = spark.read.csv("a.csv",header=True,schema=schema)

df.show()

which results

+-----+-----+-----+-----+-----+
|col_a|col_b|col_c|col_d|col_e|
+-----+-----+-----+-----+-----+
|    1|    2|    3|    4|    5|
|    1|    2|    3|    e| null|
|    1|    a|    b| null| null|
+-----+-----+-----+-----+-----+

edited Jan 22 '20 at 17:21

answered Feb 26 '19 at 16:22

Ranga Vure

1,922
3
16
23

For me, that causes a malformed error for all rows that don't have the full number of commas (in this case, 4 commas) – Nic Scozzaro Jan 22 '20 at 16:43
is it? i ran it with my local pyspark 2.4.4 it didnt show any errors. – Ranga Vure Jan 22 '20 at 17:23
One difference is that I have `header=False`. I'm using 2.4.3 – Nic Scozzaro Jan 22 '20 at 18:14
Through testing I just found that the main problem was that I am explicitly specifying mode='FAILFAST' – Nic Scozzaro Jan 22 '20 at 18:34

Read CSV file with variable number of commas as a data frame in PYSPARK

1 Answers1