0

I m using Azure Databricks to convert a pandas Dataframe into a koalas data frame...

kdf = ks.DataFrame(pdf)

This results in an error message of "an integer is required (got type str)"

I tried adding a dtype of str to force the koalas dataframe to be of type string. ..

 df = ks.DataFrame(pdf, dtype='str')

adding the dtype works perfectly in vs code using the databricks extention but results in an AssertionError when executed in azure databricks workspace.

It seems like azure databricks must be using a different version of koalas than the vs code databricks extention.

How can I get this to work in azure databricks?

How can I find out what version of koalas azure databricks is using and what version of koalas the databricks vs code extention is using?

I cant just use pip list to find vs code version of koalas because it is an extention, rather than an installed package.

Any help on this would be gratefully received

Laura

rioV8
  • 24,506
  • 3
  • 32
  • 49
Laura Baker
  • 497
  • 5
  • 14

1 Answers1

0

You can find out the version of any imported library by printing module.__version__.

E.g. print(ks.__version__) in a Databricks notebook will print something like 1.0.1.

As for the real question: for a long time, Pandas did not have a specific dtype for string, they were just objects. The string dtype was added in recently (I think Pandas 1.0+). The problem is twofold:

  1. Koalas does not yet seem to understand this string dtype. You cannot pass in mixed dtypes in the constructor as you mention. So you should cast string columns back to object.
  2. If you do not specify the dtypes, Koalas will try to be smart and infer the dtype for object columns. This will fail sometimes, e.g. if the column is all null.

So we have this infuriating situation where we know the column dtype but have no way to specify to Koalas what dtype it should use. The workaround is to fill in null values with an empty string, so that inference of dtypes will work:

for stringcol in df.select_dtypes('string').columns:
  df[stringcol] = df[stringcol].fillna("").astype("object")

ks.DataFrame(df)
rdeboo
  • 377
  • 4
  • 11