4

Trying to get my column to be formatted as INT as the 1.0 2.0 3.0 is causing issues with how I am using the data. The first thing I tried was df['Severity'] = pd.to_numeric(df['Severity'], errors='coerce'). While this looked like it worked initially, it reverted back to appearing as float when I wrote to csv. Next I tried using df['Severity'] = df['Severity'].astype(int) followed by another failed attempt using df['Severity'] = df['Severity'].astype(int, errors='coerce') because it seemed a logical solution to me.

I did some digging into pandas' docs and found this regarding how pandas handles NAs:

Typeclass   Promotion dtype for storing NAs
floating    no change
object  no change
integer cast to float64
boolean cast to object

What I find strange though, is that when I run df.info(), I get Severity 452646 non-null object

Sample Data:

Age,Severity
1,1
2,2
3,3
4,NaN
5,4
6,4
7,5
8,7
9,6
10,5

Any help would be greatly appreciated :)

anshanno
  • 344
  • 4
  • 21
  • Unfortunately I have found this to be a limitation of pandas. – IanS Sep 08 '16 at 12:15
  • Well, that is a rather large shame and makes me sad :( – anshanno Sep 08 '16 at 12:25
  • 2
    You have to either drop them or replace them with an integer value, `NaN` cannot be represented by integer, this has nothing to do with pandas and more to do with numpy and in general terms number representation limitations in computer languages. It's the same problem for C++ as it is for python and probably lots of other languages. You get `object` as the `dtype` because the type is mixed here see here: https://en.wikipedia.org/wiki/NaN – EdChum Sep 08 '16 at 12:25
  • @EdChum, Thanks! Setting NaN to 0 works fine with my application. – anshanno Sep 08 '16 at 12:33
  • 1
    Specifically in that wiki article: https://en.wikipedia.org/wiki/NaN#Integer_NaN – EdChum Sep 08 '16 at 12:44

2 Answers2

1

It's up to you how to handle missing values there is no correct way as it's up to you. You can either drop them using dropna or replace/fill them using replace/fillna, note that there is no way to represent NaN using integers: https://en.wikipedia.org/wiki/NaN#Integer_NaN.

The reason you get object as the dtype is because you now have a mixture of integers and floats. Depending on the operation then the entire Series maybe upcast to float but in your case you have mixed dtypes.

EdChum
  • 376,765
  • 198
  • 813
  • 562
1

As of pandas 0.24 (January 2019), it is possible to do what you want by using the nullable integer data type, using an arrays.IntegerArray to represent the data:

In [83]: df.Severity
Out[83]:
0    1.0
1    2.0
2    3.0
3    NaN
4    4.0
5    4.0
6    5.0
7    7.0
8    6.0
9    5.0
Name: Severity, dtype: float64

In [84]: df.Severity.astype('Int64')
Out[84]:
0      1
1      2
2      3
3    NaN
4      4
5      4
6      5
7      7
8      6
9      5
Name: Severity, dtype: Int64
fuglede
  • 17,388
  • 2
  • 54
  • 99