I am trying to combine columns from different dataframes into one for analysis. I am collecting all the columns I need into a dictionary.
I now have a dictionary like this -
newDFDict = {
'schoolName': school.INSTNM,
'type': school.CONTROL,
'avgCostAcademicYear': costs.COSTT4_A,
'avgCostProgramYear': costs.COSTT4_P,
'averageNetPricePublic': costs.NPT4_PUB,
}
{
'schoolName': Column<b'INSTNM'>,
'type': Column<b'CONTROL'>,
'avgCostAcademicYear': Column<b'COSTT4_A'>,
'avgCostProgramYear': Column<b'COSTT4_P'>,
'averageNetPricePublic': Column<b'NPT4_PUB'>
}
I want to convert this dictionary to a Pyspark dataframe.
I have tried this method but the output is not what I was expecting -
newDFDict = {
'schoolName': school.select("INSTNM").collect(),
'type': school.select("CONTROL").collect(),
'avgCostAcademicYear': costs.select("COSTT4_A").collect(),
'avgCostProgramYear': costs.select("COSTT4_P").collect(),
'averageNetPricePublic': costs.select("NPT4_PUB").collect(),
}
newDF = sc.parallelize([newDFDict]).toDF()
newDF.show()
+---------------------+--------------------+--------------------+--------------------+--------------------+
|averageNetPricePublic| avgCostAcademicYear| avgCostProgramYear| schoolName| type|
+---------------------+--------------------+--------------------+--------------------+--------------------+
| [[NULL], [NULL], ...|[[NULL], [NULL], ...|[[NULL], [NULL], ...|[[Community Colle...|[[1], [1], [1], [...|
+---------------------+--------------------+--------------------+--------------------+--------------------+
Is it even possible? If possible, how?
Is this the right way to do this? If not, how can I achieve this?
Using pandas is not an option as data is pretty big (2-3 GB) and pandas is just too slow. I am running pyspark on my local machine.
Thanks in advance! :)