1

I am trying to combine columns from different dataframes into one for analysis. I am collecting all the columns I need into a dictionary.

I now have a dictionary like this -

newDFDict = {
    'schoolName': school.INSTNM,
    'type': school.CONTROL,
    'avgCostAcademicYear': costs.COSTT4_A, 
    'avgCostProgramYear': costs.COSTT4_P, 
    'averageNetPricePublic': costs.NPT4_PUB, 
}

{
 'schoolName': Column<b'INSTNM'>,
 'type': Column<b'CONTROL'>,
 'avgCostAcademicYear': Column<b'COSTT4_A'>,
 'avgCostProgramYear': Column<b'COSTT4_P'>,
 'averageNetPricePublic': Column<b'NPT4_PUB'>
}

I want to convert this dictionary to a Pyspark dataframe.

I have tried this method but the output is not what I was expecting -

newDFDict = {
    'schoolName': school.select("INSTNM").collect(),
    'type': school.select("CONTROL").collect(),
    'avgCostAcademicYear': costs.select("COSTT4_A").collect(), 
    'avgCostProgramYear': costs.select("COSTT4_P").collect(), 
    'averageNetPricePublic': costs.select("NPT4_PUB").collect(), 
}

newDF = sc.parallelize([newDFDict]).toDF()
newDF.show()
+---------------------+--------------------+--------------------+--------------------+--------------------+
|averageNetPricePublic| avgCostAcademicYear|  avgCostProgramYear|          schoolName|                type|
+---------------------+--------------------+--------------------+--------------------+--------------------+
| [[NULL], [NULL], ...|[[NULL], [NULL], ...|[[NULL], [NULL], ...|[[Community Colle...|[[1], [1], [1], [...|
+---------------------+--------------------+--------------------+--------------------+--------------------+

Is it even possible? If possible, how?

Is this the right way to do this? If not, how can I achieve this?

Using pandas is not an option as data is pretty big (2-3 GB) and pandas is just too slow. I am running pyspark on my local machine.

Thanks in advance! :)

Ram Ghadiyaram
  • 28,239
  • 13
  • 95
  • 121
  • you might have joined 2 dataframes using common key and then .... you can select columns that are needed for your so called dictionary right? there is no common column for both dataframes ? collect is killer with OOM if data is more. – Ram Ghadiyaram May 13 '20 at 17:57
  • @RamGhadiyaram there are 10 tables with a combined total of over 1900 columns. I was thinking instead of joining 10 tables and selecting maybe 40-50 columns, we can do it via dictionary.. – raghhuveer-jaikanth May 13 '20 at 18:08
  • what is wrong in ***join*** or ***union with all common columns*** as a single view/dataframe ? big no to collect... you are pulling all the data in to driver hence OOM – Ram Ghadiyaram May 13 '20 at 18:12
  • I am not understanding what you are trying to say. Also I should mention the only common key is row number. – raghhuveer-jaikanth May 13 '20 at 18:15
  • see my example, if you dont have join key you could go for union as below answer which is sql representation of your common dict fields – Ram Ghadiyaram May 13 '20 at 18:24

1 Answers1

1

These are 2 options I'd suggest

Option1 (union case to build dictionary) :

You said, you have >=10 tables (which you want to build dictionary from ) which has common columns (such as for example 'schoolName','type' 'avgCostAcademicYear' ,'avgCostProgramYear' , 'averageNetPricePublic' are common columns ) then you can go for union or unionByName to form single consolidated. view of the data.

For example :

select 'schoolName','type' 'avgCostAcademicYear' ,'avgCostProgramYear' , 'averageNetPricePublic' from df1

 union  

select 'schoolName','type' 'avgCostAcademicYear' ,'avgCostProgramYear' , 'averageNetPricePublic' from df2
 ....
union
select 'schoolName','type' 'avgCostAcademicYear' ,'avgCostProgramYear' , 'averageNetPricePublic' from dfN 

will give you consolidated view of your dictionary

Option 2: (in case you have common join columns only)

If you have some common join columns you can also go for standard joins no matter how many tables are present ..

for psuedo sql example :

select dictionary columns from table1,table2,table3,... tablen where join common columns in all tables (table1... tablen)

note miss any join column will lead to cartesian product

Ram Ghadiyaram
  • 28,239
  • 13
  • 95
  • 121