I am stuck with a weird type conversion issue.. I fill a pandas column of type 'float' with integer values. Naturally, they are represented as floating-point arithmetic figures, but still "accurate" to int precision. Converting them to int
works like a charm, but converting to Int
directly blows up...
Say pandas DataFrame pp
has a column Value
. All values that get written in there are 'int', then saved as type float
.
print(f"pp['Value']:\n{pp['Value']}")
pp['Value']:
0 3500000.0
1 600000.0
2 400000.0
3 8300000.0
4 5700000.0
5 4400000.0
Name: Value, dtype: float64
Clearly pp['Value']
is of dtype float
, which it should, because it might contain NaN
(even if here, all values are integers).
Now I want this Series to be of type Int64
. Should work, right? But: Does not: pp['Value'].astype('Int64')
raises TypeError: cannot safely cast non-equivalent float64 to int64
Huh? An int
has turned non-convertible to float
? Ok, that can happen.. So let's see whether we can safely convert to int
? [NOTE: Only works for the example here. Not a solution if the series contains NaN
]
Appraoch A: Convert the series to int64
- works like a charm (the numbers
really all are integer-casteable):
pp['Value'] = pp['Value'].astype('int64')
print(f"pp['Value']:\n{pp['Value']}")
pp['Value']:
0 3500000
1 600000
2 400000
3 8300000
4 5700000
5 4400000
Name: Value, dtype: int64
Ok, so... what?? Converting to int
works, while Int
fails? Wait a second..
Approach B: Let's look exactly, convert each element individually, and check whether any values have some weird floating-point arithmetic issues. And indeed, we see 1 case blows up: The 4th value gets shown with the weird floating-point arithmetic nuisance of .0000001. But: pandas knows how to cast this as int
, anyway. All values get nicely converted, as I would hope:
for idx, row in pp.iterrows():
print(f"{idx}: value = {row['Value']}, residual vs. int: {row['Value']%row['Value']}, int value: {int(row['Value'])}")
0: value = 3500000.0, residual vs. int: 0.0, int value: 3500000
1: value = 600000.0, residual vs. int: 0.0, int value: 600000
2: value = 400000.0, residual vs. int: 0.0, int value: 400000
3: value = 8300000.000000001, residual vs. int: 0.0, int value: 8300000
4: value = 5700000.0, residual vs. int: 0.0, int value: 5700000
5: value = 4400000.0, residual vs. int: 0.0, int value: 4400000
So: Wait a minute... what is going on here? I can feed int
into a float column, suffering floating-point arithmetic issues. Ok, got that. But then while I can safely cast back to int
(all values individually, or the entire Series), I cannot cast to Int64
??
--> Why does pandas/python know natively how to cast to int64
, while a conversion to Int64
shows floating-point arithmetic issues?
Edit note:
pp['Value'] = pp['Value'].round().astype('Int64')
is indeed a workaround.. which should be completely unneccesary, as pp['Value'].astype('int') works (except for the NaN
records, of course...)