0

I am working on loading a sample csv file using koalas. What I see is a weird behavior.

The file has a blank column area_code which looks like this. As you can see, it is a blank column. All the rows for this column have blank.

enter image description here

When I read the file as df = ks.read_csv('zipcodes.csv'), I get the following output, which means that the column has nulls, as expected, all good.

enter image description here

When I read the file as df = ks.read_csv('zipcodes.csv', dtype = str), I get the following output, which means that the column doesn't have any nulls.

enter image description here

After a closer look, it seems that the dtype = str is causing this column to be loaded with a string value = None

enter image description here

Any reason why would this happen. Any help is appreciated. Thanks in advance.

Bhupesh C

KrazzyNefarious
  • 3,202
  • 3
  • 20
  • 32
  • I am probably not understanding the question well, but when you specify the dtype when loading a csv in pandas, it casts all the data in all columns to the provided dtype. In your case it's str. and str(None) == "None" same as with booleans. – Julia Sep 05 '22 at 14:36

1 Answers1

1

For pandas, that issue was discussed here and seems to be solved.

I don't know much about koalas but you can try this :

import numpy as np

df = ks.read_csv('zipcodes.csv', dtype=str, keep_default_na=False).replace('', np.nan)
Timeless
  • 22,580
  • 4
  • 12
  • 30