1

I am trying to add a new column to my existing Koalas dataframe. But the values turn into NaN's as soon as the new column is added. I am not sure what's going on here, could anyone give me some pointers?

Here's the code:

import databricks.koalas as ks

kdf = ks.DataFrame(
    {'a': [1, 2, 3, 4, 5, 6],
     'b': [100, 200, 300, 400, 500, 600],
     'c': ["one", "two", "three", "four", "five", "six"]},
    index=[10, 20, 30, 40, 50, 60])

ks.set_option('compute.ops_on_diff_frames', True)
ks_series = ks.Series((np.arange(len(kdf.to_numpy()))))
kdf["values"] = ks_series

ks.reset_option('compute.ops_on_diff_frames')
mck
  • 40,932
  • 13
  • 35
  • 50
ShellZero
  • 4,415
  • 12
  • 38
  • 56

1 Answers1

1

You need to match the index when adding a new column:

import databricks.koalas as ks
import numpy as np

kdf = ks.DataFrame(
    {'a': [1, 2, 3, 4, 5, 6],
     'b': [100, 200, 300, 400, 500, 600],
     'c': ["one", "two", "three", "four", "five", "six"]},
    index=[10, 20, 30, 40, 50, 60])

ks.set_option('compute.ops_on_diff_frames', True)
ks_series = ks.Series(np.arange(len(kdf.to_numpy())), index=kdf.index.tolist())
kdf["values"] = ks_series

kdf
    a    b      c  values
10  1  100    one       0
20  2  200    two       1
30  3  300  three       2
40  4  400   four       3
50  5  500   five       4
60  6  600    six       5
mck
  • 40,932
  • 13
  • 35
  • 50
  • Awesome :) thank you so much :) And of course I had to do `kdf = kdf.sort_index()` in order to get them back in order. Seems like that is necessary as the order gets messed up when adding a new column. Are there any other good practices to keep in mind? I am kinda newbie to the Spark world.. @mck – ShellZero May 24 '21 at 15:37
  • 1
    If you want to use spark, I'd suggest avoiding the use of index columns, which is not native to spark (only to pandas) and causes many issues in koalas – mck May 24 '21 at 15:39
  • I see, I need to ponder that approach in my workflow. Without indexing, I think it would be tricky to insert the data into a relational db. – ShellZero May 24 '21 at 15:44
  • 1
    Just to clarify, you can have an index column, but it should be a separate column, like a, b, c, etc – mck May 24 '21 at 15:46
  • Got it. I'll have to look into that :) Thank you for clarifying it :) – ShellZero May 24 '21 at 15:48
  • This example worked because the index is not named in the original df that @mck created. For good measure add ```ks_series.index.name = kdf.index.name```. If the two indexes are not named same then this fails – figs_and_nuts Dec 08 '21 at 13:54