How to create new columns based on cartesian product of multiple columns from pyspark dataframes

Question

Let me take a simple example to explain what I am trying to do. let us say we have two very simple dataframes as below:

Df1
+---+---+---+
| a1| a2| a3|
+---+---+---+
|  2|  3|  7|
|  1|  9|  6|
+---+---+---+

Df2
+---+---+
| b1| b2|
+---+---+
| 10|  2|
|  9|  3|
+---+---+

From df1, df2, we need to create a new df with columns that are Cartesian product of original columns from df1, df2. Particularly, the new df will have ‘a1b1’,’a1b2’,’a2b1’,’a2b2’,’a3b1’,’a3b2’, and the rows will be the multiplication of corresponding columns from df1, df2. Result df should look like the following:

Df3
+----+----+----+----+----+----+
|a1b1|a1b2|a2b1|a2b2|a3b1|a3b2|
+----+----+----+----+----+----+
|  20|   4|  30|   6|  70|  14|
|   9|   3|  81|  27|  54|  18|
+----+----+----+----+----+----+

I have searched spark online docs as well as questions posted here but it seems that they are all about cartesian product of rows, not columns. For example, rdd.cartesian() provides cartesian product of different combination of values in row, like the following code:

r = sc.parallelize([1, 2])
r.cartesian(r).toDF().show()

+---+---+
| _1| _2|
+---+---+
|  1|  1|
|  1|  2|
|  2|  1|
|  2|  2|
+---+---+

But this is not what I need. Again, I need to create new columns instead of rows. Number of rows will remain same in my problem. I understand udf can eventually solve the problem. However in my real application, we have huge dataset which takes too long to create all columns (about 500 new columns as the all possible combinations of columns). we prefer to have some sorts of vector operation which may increase the efficiency. I may be wrong, but spark udf seems like to be based on row operations which may be the reason why it took so long to finish.

Thanks a lot for any suggestions/feedback/comments.

For your convenience, I attached the simple code here to create the example dataframes shown above:

df1 = sqlContext.createDataFrame([[2,3,7],[1,9,6]],['a1','a2','a3'])
df1.show()

df2 = sqlContext.createDataFrame([[10,2],[9,3]],['b1','b2'])
df2.show()

How do you link rows? Order is not something you can depend on in general. — zero323, Feb 17 '17 at 22:33
Hi Zero323, thanks for your message. We have primary key to link rows. Here let us simply assume that rows are matching by integer indices and all dataframes have same number of rows. — spectrum, Feb 17 '17 at 23:08
OK, so pro tip: having explicit key is good. Depending on indices is not :) In general `df1.join(df2, ['id']).select([df1[x] * df2[y] for x in df1.columns for y in df2.columns if x != 'id' and y != 'id'])` when `id` is a linking column. — zero323, Feb 18 '17 at 11:51
Hi zero323, your pro code works well, man :) However, column names generated not as I want. I can easily rename them though. thanks a lot! — spectrum, Feb 21 '17 at 20:57

score 0 · Answer 1 · answered Feb 17 '17 at 20:39

Its not straightforward as far as i know. Here is a shot at it using eval :

# function to add rownumbers in a dataframe
def addrownum(df):
    dff = df.rdd.zipWithIndex().toDF(['features','rownum'])
    odf = dff.map(lambda x : tuple(x.features)+tuple([x.rownum])).toDF(df.columns+['rownum'])
    return odf

df1_ = addrownum(df1)
df2_ = addrownum(df2)
# Join based on rownumbers
outputdf = df1_.rownum.join(df2_,df1_.rownum==df2_.rownum).drop(df1_.rownum).drop(df2_.rownum)

n1 = ['a1','a2','a3']  # columns in set1
n2 = ['b1','b2']       # columns in set2

# I create a string of expression that I want to execute
eval_list = ['x.'+l1+'*'+'x.'+l2 for l1 in n1 for l2 in n2]
eval_str = '('+','.join(eval_list)+')'
col_list = [l1+l2 for l1 in n1 for l2 in n2] 

dfcartesian = outputdf.map(lambda x:eval(eval_str)).toDF(col_list)

Something else that might be of help to you is Elementwise Product in spark.ml.feature but it will be no less complex. You take elements from one list multiple element wise to the other list and expand the feature vectors back to a dataframe.

Hi Thanks for replying. Again, the method you are using is row operation which is very slow for huge data set. Also, Elementwise product in mllib does not work because it is using a separate weight vector to multiply the array cell in a row. — spectrum, Feb 21 '17 at 21:01

How to create new columns based on cartesian product of multiple columns from pyspark dataframes

1 Answers1