Koalas Dataframe read_csv reads null column as not null

Question

I am working on loading a sample csv file using koalas. What I see is a weird behavior.

The file has a blank column area_code which looks like this. As you can see, it is a blank column. All the rows for this column have blank.

When I read the file as df = ks.read_csv('zipcodes.csv'), I get the following output, which means that the column has nulls, as expected, all good.

When I read the file as df = ks.read_csv('zipcodes.csv', dtype = str), I get the following output, which means that the column doesn't have any nulls.

After a closer look, it seems that the dtype = str is causing this column to be loaded with a string value = None

Any reason why would this happen. Any help is appreciated. Thanks in advance.

Bhupesh C

I am probably not understanding the question well, but when you specify the dtype when loading a csv in pandas, it casts all the data in all columns to the provided dtype. In your case it's str. and str(None) == "None" same as with booleans. — Julia, Sep 05 '22 at 14:36

score 1 · Answer 1 · answered Sep 06 '22 at 16:58

1

For pandas, that issue was discussed here and seems to be solved.

I don't know much about koalas but you can try this :

import numpy as np

df = ks.read_csv('zipcodes.csv', dtype=str, keep_default_na=False).replace('', np.nan)

answered Sep 06 '22 at 16:58

Timeless

22,580
4
12
30

Koalas Dataframe read_csv reads null column as not null

1 Answers1