Let me take a simple example to explain what I am trying to do. let us say we have two very simple dataframes as below:
Df1
+---+---+---+
| a1| a2| a3|
+---+---+---+
| 2| 3| 7|
| 1| 9| 6|
+---+---+---+
Df2
+---+---+
| b1| b2|
+---+---+
| 10| 2|
| 9| 3|
+---+---+
From df1, df2, we need to create a new df with columns that are Cartesian product of original columns from df1, df2. Particularly, the new df will have ‘a1b1’,’a1b2’,’a2b1’,’a2b2’,’a3b1’,’a3b2’, and the rows will be the multiplication of corresponding columns from df1, df2. Result df should look like the following:
Df3
+----+----+----+----+----+----+
|a1b1|a1b2|a2b1|a2b2|a3b1|a3b2|
+----+----+----+----+----+----+
| 20| 4| 30| 6| 70| 14|
| 9| 3| 81| 27| 54| 18|
+----+----+----+----+----+----+
I have searched spark online docs as well as questions posted here but it seems that they are all about cartesian product of rows, not columns. For example, rdd.cartesian() provides cartesian product of different combination of values in row, like the following code:
r = sc.parallelize([1, 2])
r.cartesian(r).toDF().show()
+---+---+
| _1| _2|
+---+---+
| 1| 1|
| 1| 2|
| 2| 1|
| 2| 2|
+---+---+
But this is not what I need. Again, I need to create new columns instead of rows. Number of rows will remain same in my problem. I understand udf can eventually solve the problem. However in my real application, we have huge dataset which takes too long to create all columns (about 500 new columns as the all possible combinations of columns). we prefer to have some sorts of vector operation which may increase the efficiency. I may be wrong, but spark udf seems like to be based on row operations which may be the reason why it took so long to finish.
Thanks a lot for any suggestions/feedback/comments.
For your convenience, I attached the simple code here to create the example dataframes shown above:
df1 = sqlContext.createDataFrame([[2,3,7],[1,9,6]],['a1','a2','a3'])
df1.show()
df2 = sqlContext.createDataFrame([[10,2],[9,3]],['b1','b2'])
df2.show()