2

I have multiple pyspark dataframes that already exist. I need to add a new column to each dataframe. I can "hardcode" the solution and it works. However, when I try to use a for loop to add the column to all dataframes, I get an error. The error indicates that "withColumn" is not attribute of the dataframe. I don't understand this error.

What I've typed below is what I've tried. I'm a python and pyspark neophyte.

import pyspark class Row from module sql

from pyspark.sql import *

CREATE EXAMPLE DATA

----> Create some data for first dataframe

Labels = Row("firstName", "gender", "height", "age", "weight")
input1 = Row(firstname='Sam', gender='M', height = '77',age = '42',weight = '190')
input2 = Row(firstname='Diane', gender='F', height =  '70',age = '21', weight ='110')
input3 = Row(firstname='Norm',gender='M', height =  '68',age = '33',weight = '240')
input4 = Row(firstname='Carla', gender='F', height =  '60',age = '29',weight = '90')

----> Create the spark dataframe for first input

bios1 = Row(input1, input2, input3, input4)
bios1_sdf = spark.createDataFrame(bios1)

----> Create some more data

Labels = Row("firstName", "gender", "height", "age", "weight")
input1 = Row(firstname='Chandler', gender='M', height = '72',age = '109',weight = '270')
input2 = Row(firstname='Monica', gender='F', height =  '64',age = '10', weight ='123')
input3 = Row(firstname='Ross',gender='M', height =  '74',age = '59',weight = '168')
input4 = Row(firstname='Phoebe', gender='F', height =  '64',age = '2',weight = '20')

----> Create the spark dataframe for second input

bios2 = Row(input1, input2, input3, input4)
bios2_sdf = spark.createDataFrame(bios2)

HARD CODED METHOD

----> add bmi t dataframe one

bios1_sdf = bios1_sdf.withColumn('bmi',(bios1_sdf.weight/(bios1_sdf.height * bios1_sdf.height)*703))
print("data in bios1_sdf")
bios1_sdf.show()

----> add bmi to dataframe two

bios2_sdf = bios2_sdf.withColumn('bmi',(bios2_sdf.weight/(bios2_sdf.height * bios2_sdf.height)*703))
print("data in bios2_sdf")
bios2_sdf.show()

------> data in bios1_sdf

+---+---------+------+------+------+------------------+
|age|firstname|gender|height|weight|               bmi|
+---+---------+------+------+------+------------------+
| 42|      Sam|     M|    77|   190| 22.52825096980941|
| 21|    Diane|     F|    70|   110|15.781632653061223|
| 33|     Norm|     M|    68|   240|  36.4878892733564|
| 29|    Carla|     F|    60|    90|            17.575|
+---+---------+------+------+------+------------------+

------> data in bios2_sdf

+---+---------+------+------+------+------------------+
|age|firstname|gender|height|weight|               bmi|
+---+---------+------+------+------+------------------+
|109| Chandler|     M|    72|   270|36.614583333333336|
| 10|   Monica|     F|    64|   123|   21.110595703125|
| 59|     Ross|     M|    74|   168| 21.56756756756757|
|  2|   Phoebe|     F|    64|    20|      3.4326171875|
+---+---------+------+------+------+------------------+

DYNAMIC CODED METHOD

----> initialize dictionary

dict_of_df ={}

----> list of pyspark dataframes to add a column

list_of_sdf = [bios1_sdf, bios2_sdf]



for i in range(1,2+1):


    # name of dataframe
    key_name_in = 'bios'+str(i)+'_sdf'
    dict_of_df[key_name_in] = list_of_sdf[i-1]
    temp_sdf = dict_of_df[key_name_in]

    # add bmi
    dict_of_df[key_name_in] = temp_sdf.withcolumn('bmi',sdf.weight/(temp_sdf.height*temp_sdf.height)*703)

ERROR

I'm expecting to get these results. The same results as my "hardcoded" example.

-----> data in bios1_sdf

+---+---------+------+------+------+------------------+
|age|firstname|gender|height|weight|               bmi|
+---+---------+------+------+------+------------------+
| 42|      Sam|     M|    77|   190| 22.52825096980941|
| 21|    Diane|     F|    70|   110|15.781632653061223|
| 33|     Norm|     M|    68|   240|  36.4878892733564|
| 29|    Carla|     F|    60|    90|            17.575|
+---+---------+------+------+------+------------------+

-----> data in bios2_sdf

+---+---------+------+------+------+------------------+
|age|firstname|gender|height|weight|               bmi|
+---+---------+------+------+------+------------------+
|109| Chandler|     M|    72|   270|36.614583333333336|
| 10|   Monica|     F|    64|   123|   21.110595703125|
| 59|     Ross|     M|    74|   168| 21.56756756756757|
|  2|   Phoebe|     F|    64|    20|      3.4326171875|
+---+---------+------+------+------+------------------+
Scratch
  • 21
  • 3

0 Answers0