Convert dictionary of columns to Dataframe in from different dataframes : pyspark

Question

I am trying to combine columns from different dataframes into one for analysis. I am collecting all the columns I need into a dictionary.

I now have a dictionary like this -

newDFDict = {
    'schoolName': school.INSTNM,
    'type': school.CONTROL,
    'avgCostAcademicYear': costs.COSTT4_A, 
    'avgCostProgramYear': costs.COSTT4_P, 
    'averageNetPricePublic': costs.NPT4_PUB, 
}

{
 'schoolName': Column<b'INSTNM'>,
 'type': Column<b'CONTROL'>,
 'avgCostAcademicYear': Column<b'COSTT4_A'>,
 'avgCostProgramYear': Column<b'COSTT4_P'>,
 'averageNetPricePublic': Column<b'NPT4_PUB'>
}

I want to convert this dictionary to a Pyspark dataframe.

I have tried this method but the output is not what I was expecting -

newDFDict = {
    'schoolName': school.select("INSTNM").collect(),
    'type': school.select("CONTROL").collect(),
    'avgCostAcademicYear': costs.select("COSTT4_A").collect(), 
    'avgCostProgramYear': costs.select("COSTT4_P").collect(), 
    'averageNetPricePublic': costs.select("NPT4_PUB").collect(), 
}

newDF = sc.parallelize([newDFDict]).toDF()
newDF.show()
+---------------------+--------------------+--------------------+--------------------+--------------------+
|averageNetPricePublic| avgCostAcademicYear|  avgCostProgramYear|          schoolName|                type|
+---------------------+--------------------+--------------------+--------------------+--------------------+
| [[NULL], [NULL], ...|[[NULL], [NULL], ...|[[NULL], [NULL], ...|[[Community Colle...|[[1], [1], [1], [...|
+---------------------+--------------------+--------------------+--------------------+--------------------+

Is it even possible? If possible, how?

Is this the right way to do this? If not, how can I achieve this?

Using pandas is not an option as data is pretty big (2-3 GB) and pandas is just too slow. I am running pyspark on my local machine.

Thanks in advance! :)

you might have joined 2 dataframes using common key and then .... you can select columns that are needed for your so called dictionary right? there is no common column for both dataframes ? collect is killer with OOM if data is more. — Ram Ghadiyaram, May 13 '20 at 17:57
@RamGhadiyaram there are 10 tables with a combined total of over 1900 columns. I was thinking instead of joining 10 tables and selecting maybe 40-50 columns, we can do it via dictionary.. — raghhuveer-jaikanth, May 13 '20 at 18:08
what is wrong in ***join*** or ***union with all common columns*** as a single view/dataframe ? big no to collect... you are pulling all the data in to driver hence OOM — Ram Ghadiyaram, May 13 '20 at 18:12
I am not understanding what you are trying to say. Also I should mention the only common key is row number. — raghhuveer-jaikanth, May 13 '20 at 18:15
see my example, if you dont have join key you could go for union as below answer which is sql representation of your common dict fields — Ram Ghadiyaram, May 13 '20 at 18:24

Ram Ghadiyaram · Accepted Answer · 2020-05-13T19:07:45.270

1

These are 2 options I'd suggest

Option1 (union case to build dictionary) :

You said, you have >=10 tables (which you want to build dictionary from ) which has common columns (such as for example 'schoolName','type' 'avgCostAcademicYear' ,'avgCostProgramYear' , 'averageNetPricePublic' are common columns ) then you can go for union or unionByName to form single consolidated. view of the data.

For example :

select 'schoolName','type' 'avgCostAcademicYear' ,'avgCostProgramYear' , 'averageNetPricePublic' from df1

 union  

select 'schoolName','type' 'avgCostAcademicYear' ,'avgCostProgramYear' , 'averageNetPricePublic' from df2
 ....
union
select 'schoolName','type' 'avgCostAcademicYear' ,'avgCostProgramYear' , 'averageNetPricePublic' from dfN

will give you consolidated view of your dictionary

Option 2: (in case you have common join columns only)

If you have some common join columns you can also go for standard joins no matter how many tables are present ..

for psuedo sql example :

select dictionary columns from table1,table2,table3,... tablen where join common columns in all tables (table1... tablen)

note miss any join column will lead to cartesian product

edited May 13 '20 at 19:07

answered May 13 '20 at 18:23

Ram Ghadiyaram

28,239
13
95
121

The only common column is row number. There are no common columns – raghhuveer-jaikanth May 13 '20 at 18:27
may not be join you mean union wont work ? have you tried ? – Ram Ghadiyaram May 13 '20 at 18:29
Instead of creating dictionary after selecting columns from every table I am now joining after creating a new dataframe with the reqd columns using monotically_increasing_id(). So option 2 – raghhuveer-jaikanth May 13 '20 at 18:48
your monotonically increasing id is not incremental as sql row number be careful see my answer [here](https://stackoverflow.com/a/52546904/647053) – Ram Ghadiyaram May 13 '20 at 18:49
I plan to drop it anyway. It will not cause issues in join though right? (I am asking about mismatched joins) – raghhuveer-jaikanth May 13 '20 at 18:54
check it you will understand I cant visualize from here – Ram Ghadiyaram May 13 '20 at 18:56
Can you modify your answer to join after selecting each column. I want to mark it as correct xD – raghhuveer-jaikanth May 13 '20 at 19:03

Convert dictionary of columns to Dataframe in from different dataframes : pyspark

1 Answers1

Option1 (union case to build dictionary) :

Option 2: (in case you have common join columns only)