I have multiple pyspark dataframes that already exist. I need to add a new column to each dataframe. I can "hardcode" the solution and it works. However, when I try to use a for loop to add the column to all dataframes, I get an error. The error indicates that "withColumn" is not attribute of the dataframe. I don't understand this error.
What I've typed below is what I've tried. I'm a python and pyspark neophyte.
import pyspark class Row from module sql
from pyspark.sql import *
CREATE EXAMPLE DATA
----> Create some data for first dataframe
Labels = Row("firstName", "gender", "height", "age", "weight")
input1 = Row(firstname='Sam', gender='M', height = '77',age = '42',weight = '190')
input2 = Row(firstname='Diane', gender='F', height = '70',age = '21', weight ='110')
input3 = Row(firstname='Norm',gender='M', height = '68',age = '33',weight = '240')
input4 = Row(firstname='Carla', gender='F', height = '60',age = '29',weight = '90')
----> Create the spark dataframe for first input
bios1 = Row(input1, input2, input3, input4)
bios1_sdf = spark.createDataFrame(bios1)
----> Create some more data
Labels = Row("firstName", "gender", "height", "age", "weight")
input1 = Row(firstname='Chandler', gender='M', height = '72',age = '109',weight = '270')
input2 = Row(firstname='Monica', gender='F', height = '64',age = '10', weight ='123')
input3 = Row(firstname='Ross',gender='M', height = '74',age = '59',weight = '168')
input4 = Row(firstname='Phoebe', gender='F', height = '64',age = '2',weight = '20')
----> Create the spark dataframe for second input
bios2 = Row(input1, input2, input3, input4)
bios2_sdf = spark.createDataFrame(bios2)
HARD CODED METHOD
----> add bmi t dataframe one
bios1_sdf = bios1_sdf.withColumn('bmi',(bios1_sdf.weight/(bios1_sdf.height * bios1_sdf.height)*703))
print("data in bios1_sdf")
bios1_sdf.show()
----> add bmi to dataframe two
bios2_sdf = bios2_sdf.withColumn('bmi',(bios2_sdf.weight/(bios2_sdf.height * bios2_sdf.height)*703))
print("data in bios2_sdf")
bios2_sdf.show()
------> data in bios1_sdf
+---+---------+------+------+------+------------------+
|age|firstname|gender|height|weight| bmi|
+---+---------+------+------+------+------------------+
| 42| Sam| M| 77| 190| 22.52825096980941|
| 21| Diane| F| 70| 110|15.781632653061223|
| 33| Norm| M| 68| 240| 36.4878892733564|
| 29| Carla| F| 60| 90| 17.575|
+---+---------+------+------+------+------------------+
------> data in bios2_sdf
+---+---------+------+------+------+------------------+
|age|firstname|gender|height|weight| bmi|
+---+---------+------+------+------+------------------+
|109| Chandler| M| 72| 270|36.614583333333336|
| 10| Monica| F| 64| 123| 21.110595703125|
| 59| Ross| M| 74| 168| 21.56756756756757|
| 2| Phoebe| F| 64| 20| 3.4326171875|
+---+---------+------+------+------+------------------+
DYNAMIC CODED METHOD
----> initialize dictionary
dict_of_df ={}
----> list of pyspark dataframes to add a column
list_of_sdf = [bios1_sdf, bios2_sdf]
for i in range(1,2+1):
# name of dataframe
key_name_in = 'bios'+str(i)+'_sdf'
dict_of_df[key_name_in] = list_of_sdf[i-1]
temp_sdf = dict_of_df[key_name_in]
# add bmi
dict_of_df[key_name_in] = temp_sdf.withcolumn('bmi',sdf.weight/(temp_sdf.height*temp_sdf.height)*703)
ERROR
I'm expecting to get these results. The same results as my "hardcoded" example.-----> data in bios1_sdf
+---+---------+------+------+------+------------------+
|age|firstname|gender|height|weight| bmi|
+---+---------+------+------+------+------------------+
| 42| Sam| M| 77| 190| 22.52825096980941|
| 21| Diane| F| 70| 110|15.781632653061223|
| 33| Norm| M| 68| 240| 36.4878892733564|
| 29| Carla| F| 60| 90| 17.575|
+---+---------+------+------+------+------------------+
-----> data in bios2_sdf
+---+---------+------+------+------+------------------+
|age|firstname|gender|height|weight| bmi|
+---+---------+------+------+------+------------------+
|109| Chandler| M| 72| 270|36.614583333333336|
| 10| Monica| F| 64| 123| 21.110595703125|
| 59| Ross| M| 74| 168| 21.56756756756757|
| 2| Phoebe| F| 64| 20| 3.4326171875|
+---+---------+------+------+------+------------------+