handle dynamic number of columns (csv file) in pyspark

Question

I am getting the CSV file below (without the header) -

D,neel,32,1,pin1,state1,male
D,sani,31,2,pin1,state1,pin2,state2,female
D,raja,33,3,pin1,state1,pin2,state2,pin3,state3,male

I want to create the CSV file below using pyspark dataframe -

D,neel,32,1,pin1,state1,male
D,sani,31,2,pin1,state1,female
D,sani,31,2,pin2,state2,female
D,raja,33,3,pin1,state1,male
D,raja,33,3,pin2,state2,male
D,raja,33,3,pin3,state3,male

note: the number in 4th column in the input file determines how many pin and state columns are in the record. like

as neel has 1 in the 4th column, thus neel has 1 set of pin and state  (pin1,state1)
as sani has 2 in 4th column, thus sani has 2 sets of pin and state  (pin1,state1,pin2,state2
as raja has 3 in 4th column, thus raja has 3 sets of pin and state  (pin1,state1,pin2,state2,pin3,state3)

I'm not able to achieve my desired output..

Welcome to SO! What exactly have you tried so far? We're much more here to help with specific questions of the form "I tried X, but it did not do what I expect and instead resulted in an error!" accompanied by a [Minimal, Complete, and Verifiable example](https://stackoverflow.com/help/mcve) — ti7, Aug 08 '23 at 19:57

parisni · Answer 1 · 2023-08-09T07:33:33.587

There is no CSV data source option to handle such dynamic case.

One way is to read it as a CSV and specifying a schema with largest known number of column.

Then you apply transformations based on 4th column by using

case 
when col4 = 1 then col7
When col4 = 2 then col8
...
End as gender

You can apply this logic for every dynamic columns, and also generate it by script if the number in col4 is high.

One other way is to read the file as a rdd of string and parse the content with custom code.

score 0 · Answer 2 · answered Aug 09 '23 at 15:50

Since the text file is unstructured we can use a custom function and apply it to an RDD to parse the text file.

def explode(row):
    a, b, c, d, *e, f = row.split(',')
    return [[a, b, c, d, *e[i: i+2], f] 
            for i in range(0, len(e), 2)]

df = spark.sparkContext.textFile('data.csv').flatMap(explode).toDF()

df.show()

+---+----+---+---+----+------+------+
| _1|  _2| _3| _4|  _5|    _6|    _7|
+---+----+---+---+----+------+------+
|  D|neel| 32|  1|pin1|state1|  male|
|  D|sani| 31|  2|pin1|state1|female|
|  D|sani| 31|  2|pin2|state2|female|
|  D|raja| 33|  3|pin1|state1|  male|
|  D|raja| 33|  3|pin2|state2|  male|
|  D|raja| 33|  3|pin3|state3|  male|
+---+----+---+---+----+------+------+

handle dynamic number of columns (csv file) in pyspark

2 Answers2