0

I have a comma separated file, with no header, with different number of items in each line separated by a comma, such as:

a, x1, x2  
b, x3, x4, x5  
c, x6, x7, x8, x9  

The first line contains only 3 items, subsequent lines contain more, so it seems the number of columns is inferred from first line only, so it skips whatever after 3rd comma in other lines and data is lost.

spark = init_spark()
df= spark.read.csv(filename)
print (df.take(3))

I get:

[Row(_c0='a', _c1=' x1', _c2=' x2'),  
Row(_c0='b', _c1=' x3', _c2=' x4'),   
Row(_c0='c', _c1=' x6', _c2=' x7')]  

mode="PERMISSIVE" in the module pyspark.sql.readwriter
doesn't solve the problem, may be because there is no header

Samer Ayoub
  • 981
  • 9
  • 10
  • You could read it in as a single column and then use map to expand. – cs95 Feb 26 '19 at 02:45
  • Possible duplicate of [Importing text file with varying number of columns in Spark](https://stackoverflow.com/questions/50158696/importing-text-file-with-varying-number-of-columns-in-spark) – pault Feb 26 '19 at 16:29

1 Answers1

0

Assuming max no of col or comma separated values known And given the file a.csv

col_a,col_b,col_c,col_d,col_e
1,2,3,4,5
1,2,3,e
1,a,b
schema = StructType([
    StructField("col_a", StringType(), True),
    StructField("col_b", StringType(), True),
    StructField("col_c", StringType(), True),
    StructField("col_d", StringType(), True),
    StructField("col_e", StringType(), True)
])

df = spark.read.csv("a.csv",header=True,schema=schema)

df.show()

which results

+-----+-----+-----+-----+-----+
|col_a|col_b|col_c|col_d|col_e|
+-----+-----+-----+-----+-----+
|    1|    2|    3|    4|    5|
|    1|    2|    3|    e| null|
|    1|    a|    b| null| null|
+-----+-----+-----+-----+-----+
Ranga Vure
  • 1,922
  • 3
  • 16
  • 23