1

I have a Koalas DataFrame in PySpark. I want to calculate the column-wise standard deviation. I have tried doing:

df2['x_std'] = df2[['x_1',
'x_2',
'x_3',
'x_4',
'x_5',
'x_6',
'x_7',
'x_8',
'x_9',
'x_10','x_11',
'x_12']].std(axis = 1) 

I get the following error:

TypeError: 'DataFrame' object does not support item assignment

I also doing something like:

d1 = df2[['x_1',
'x_2',
'x_3',
'x_4',
'x_5',
'x_6',
'x_7',
'x_8',
'x_9',
'x_10','x_11',
'x_12']].std(axis = 1) 

df2['x_std'] = d1 # d1 is a Koalas Series that should get assigned to the new column.

I get this error while doing so:

Cannot combine column argument because it comes from a different dataframe

Totally new to Koalas. Can anyone give some ideas? Thanks.

K. K.
  • 552
  • 1
  • 11
  • 20

1 Answers1

0

You can set the option "compute.ops_on_diff_frames" to True and then perform the operation.

import databricks.koalas as ks

ks.set_option("compute.ops_on_diff_frames", True)

kdf = ks.DataFrame(
    {'a': [1, 2, 3, 4, 5, 6],
     'b': [2, 1, 7, 4, 2, 3],
     'c': [3, 7, 1, 4, 6, 5],
     'd': [4, 2, 3, 4, 3, 8],},)

kdf['dev'] = kdf[['a', 'b', 'c', 'd']].std(axis=1)
print (kdf)

   a  b  c  d       dev
0  1  2  3  4  1.241909
5  6  3  5  8  2.363684
1  2  1  7  2  2.348840
3  4  4  4  4  1.788854
2  3  7  1  3  2.223378
4  5  2  6  3  1.856200

Not sure it is good practice though as it is not allowed by default.

Ben.T
  • 29,160
  • 6
  • 32
  • 54