I have a comma separated file, with no header, with different number of items in each line separated by a comma, such as:
a, x1, x2
b, x3, x4, x5
c, x6, x7, x8, x9
The first line contains only 3 items, subsequent lines contain more, so it seems the number of columns is inferred from first line only, so it skips whatever after 3rd comma in other lines and data is lost.
spark = init_spark()
df= spark.read.csv(filename)
print (df.take(3))
I get:
[Row(_c0='a', _c1=' x1', _c2=' x2'),
Row(_c0='b', _c1=' x3', _c2=' x4'),
Row(_c0='c', _c1=' x6', _c2=' x7')]
mode="PERMISSIVE" in the module pyspark.sql.readwriter
doesn't solve the problem, may be because there is no header