Pandas DataFrame, default data type for 1, 2, 3, and NaN values

Question

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
  'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print df ['one']

Output:

    a    1.0

    b    2.0

    c    3.0

    d    NaN

Name: one, dtype: float64

The value is set as float

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
  'two' : pd.Series([1, 2, 3], index=['a', 'b', 'c'])}

df = pd.DataFrame(d)
print df ['one']

Output:

a    1

b    2

c    3

Name: one, dtype: int64

But now the value is set as int64.

The difference is the first one, there is a NaN in the value.

What is the rule behind the set up of the data types in the above examples?

Thanks!

score 6 · Answer 1 · answered Jul 12 '18 at 23:57

6

Type of NaN is float, so pandas will infer all ints numbers to be floats too.

This can be easily checked :

>>> type(np.nan) 
float

I would recommend this interesting read

answered Jul 12 '18 at 23:57

rafaelc

57,686
15
58
82

Joel Bondurant · Accepted Answer · 2018-07-15T03:37:42.877

pandas inherits many bad decisions from numpy.

Refer to:

Pandas Gotchas - Integer NA

Numpy or Pandas, keeping array type as integer while having a nan value

If you look at type(df.iloc[3,0]), you can see nan is of type numpy.float64, which forces type coercion of the entire column to floats. Basically, Pandas is garbage for dealing with nullable integers, and you just have to deal with them as floating point numbers. You can also use the object type to hold integers, if performance isn't a concern.

Pandas DataFrame, default data type for 1, 2, 3, and NaN values

2 Answers2