1

I created the following dataframe:

import pandas as pd
import databricks.koalas as ks
df = ks.DataFrame(
    {'Date1': pd.date_range('20211101', '20211110', freq='1D'), 
     'Date2': pd.date_range('20201101', '20201110', freq='1D')})
df

Out[0]:

Date1 Date2
0 2021-11-01 2020-11-01
1 2021-11-02 2020-11-02
2 2021-11-03 2020-11-03
3 2021-11-04 2020-11-04
4 2021-11-05 2020-11-05
5 2021-11-06 2020-11-06
6 2021-11-07 2020-11-07
7 2021-11-08 2020-11-08
8 2021-11-09 2020-11-09
9 2021-11-10 2020-11-10

When trying to get the minimum of Date1 I get the correct result:

df.Date1.min()

Out[1]:

Timestamp('2021-11-01 00:00:00')

Also, when trying to get the minimum values of each row the correct result is returned:

df.min(axis=1)

Out[2]:

0   2020-11-01
1   2020-11-02
2   2020-11-03
3   2020-11-04
4   2020-11-05
5   2020-11-06
6   2020-11-07
7   2020-11-08
8   2020-11-09
9   2020-11-10
dtype: datetime64[ns]

However, using the same functions on columns fails:

df.min(axis=0)

Out[3]:

Series([], dtype: float64)

Does anyone know why this is and if there's an elegant way around it?

Eran
  • 844
  • 6
  • 20

2 Answers2

2

Try this:

df.apply(min, axis=0)

Out[1]:

Date1   2021-11-01
Date2   2020-11-01
dtype: datetime64[ns]
0

This was indeed a bug in the code, but since then Koalas was merged with pyspark and the pandas on spark API was born. More information here.

Using spark 3.2.0 and above, one needs to replace

import databricks.koalas as ks

With

import pyspark.pandas as ps

and replace ks.DataFrame with ps.DataFrame. This completely eliminates the issue.

Eran
  • 844
  • 6
  • 20