0
dtype={"ColA": str}
----------------------------------------------
use_koalas: True
df:
      ColA      ColB    ColC
0        A         0    0.00
1        None      1   12.30
2        C         2   22.20
3        D         1    3.14
type(df['ColA'][1]): <class 'NoneType'>
df[df.notna()]:
      ColA      ColB    ColC
0        A         0    0.00
1        None      1   12.30
2        C         2   22.20
3        D         1    3.14
type(df['ColA'][1]): <class 'NoneType'>
df = df[df.notna()].astype(dtype)
df:
      ColA      ColB    ColC
0        A         0    0.00
1        None      1   12.30
2        C         2   22.20
3        D         1    3.14
type(df['ColA'][1]): <class 'NoneType'>
----------------------------------------------
use_koalas: False
df:
      ColA      ColB    ColC
0        A         0    0.00
1     None         1   12.30
2        C         2   22.20
3        D         1    3.14
type(df['ColA'][1]): <class 'NoneType'>
df[df.notna()]:
      ColA      ColB    ColC
0        A         0    0.00
1      NaN         1   12.30
2        C         2   22.20
3        D         1    3.14
type(df[df.notna()]['ColA'][1]): <class 'float'>
df = df[df.notna()].astype(dtype)
df:
      ColA      ColB    ColC
0        A         0    0.00
1      nan         1   12.30
2        C         2   22.20
3        D         1    3.14
type(df['ColA'][1]): <class 'str'>
----------------------------------------------

I've messed around with using "string" for my dtype instead of str but there are some downstream effects. This is on a very large dataset so ideally I would not be using the mask function. So why are the pandas and koalas dataframes/functions behaving differently?

  • This seems to be more an issue with `DataFrame.mask` using a different default value for string columns between the two libraries. [DataFrame.mask](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.mask.html) explicitly says it uses `NaN`, which is what you see and with the recent deprication of try_cast pandas clearly wants you to manually recast the Series to the correct type after. – ALollz Oct 25 '21 at 18:13
  • @ALollz I'm using the indexing operator, not DataFrame.mask though. Are these the same operation under the hood? – Steven Adler Oct 25 '21 at 18:26
  • Yes, when you provide a DataFrame of boolean values to `[]` it uses `where` under the hood: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#the-where-method-and-masking, see the example starting with `In[185]:` Mask and where are basically complements so I believe their implementation is identical – ALollz Oct 25 '21 at 18:29

0 Answers0