1

I am stuck with a weird type conversion issue.. I fill a pandas column of type 'float' with integer values. Naturally, they are represented as floating-point arithmetic figures, but still "accurate" to int precision. Converting them to int works like a charm, but converting to Int directly blows up...

Say pandas DataFrame pp has a column Value. All values that get written in there are 'int', then saved as type float.

print(f"pp['Value']:\n{pp['Value']}")
pp['Value']:
0      3500000.0
1       600000.0
2       400000.0
3      8300000.0
4      5700000.0
5      4400000.0
Name: Value, dtype: float64

Clearly pp['Value'] is of dtype float, which it should, because it might contain NaN (even if here, all values are integers).

Now I want this Series to be of type Int64. Should work, right? But: Does not: pp['Value'].astype('Int64') raises TypeError: cannot safely cast non-equivalent float64 to int64

Huh? An int has turned non-convertible to float? Ok, that can happen.. So let's see whether we can safely convert to int? [NOTE: Only works for the example here. Not a solution if the series contains NaN]

Appraoch A: Convert the series to int64 - works like a charm (the numbers really all are integer-casteable):

pp['Value'] = pp['Value'].astype('int64')
print(f"pp['Value']:\n{pp['Value']}")
pp['Value']:
0      3500000
1       600000
2       400000
3      8300000
4      5700000
5      4400000
Name: Value, dtype: int64

Ok, so... what?? Converting to int works, while Int fails? Wait a second..

Approach B: Let's look exactly, convert each element individually, and check whether any values have some weird floating-point arithmetic issues. And indeed, we see 1 case blows up: The 4th value gets shown with the weird floating-point arithmetic nuisance of .0000001. But: pandas knows how to cast this as int, anyway. All values get nicely converted, as I would hope:

for idx, row in pp.iterrows():
   print(f"{idx}: value = {row['Value']}, residual vs. int: {row['Value']%row['Value']}, int value: {int(row['Value'])}")
0: value = 3500000.0, residual vs. int: 0.0, int value: 3500000
1: value = 600000.0, residual vs. int: 0.0, int value: 600000
2: value = 400000.0, residual vs. int: 0.0, int value: 400000
3: value = 8300000.000000001, residual vs. int: 0.0, int value: 8300000
4: value = 5700000.0, residual vs. int: 0.0, int value: 5700000
5: value = 4400000.0, residual vs. int: 0.0, int value: 4400000

So: Wait a minute... what is going on here? I can feed int into a float column, suffering floating-point arithmetic issues. Ok, got that. But then while I can safely cast back to int (all values individually, or the entire Series), I cannot cast to Int64??

--> Why does pandas/python know natively how to cast to int64, while a conversion to Int64 shows floating-point arithmetic issues?


Edit note:

pp['Value'] = pp['Value'].round().astype('Int64')

is indeed a workaround.. which should be completely unneccesary, as pp['Value'].astype('int') works (except for the NaN records, of course...)

KingOtto
  • 840
  • 5
  • 18
  • Perhaps the issues caused by the 4th value? `8300000.000000001`, you could check by running your code without that value. – Jason Mar 06 '22 at 19:38

1 Answers1

1

As Jason suggested in his comment, your edit solves the problem because rounding changes 8300000.000000001 to 8300000.0.

This is important as it means that after the type conversion the two values are still equal, and so they meet the "safe" casting rule for numpy conversions. When converting to 'Int64' pandas use the numpy.ndarray.astype function which applies this rule. The details on "safe" casting can be found here.

As far as I am aware, there is no way to request that pandas uses the numpy function with a different type of casting, so rounding the values first is the solution to your problem.

FluxZA
  • 11
  • 1
  • Not sure I follow your answer. The conversion *is safe*, in the sense that the direct call to `().astype('int')` is supported. Just the one to `Int64` causes issues. How can one be safe, and the other not? – KingOtto Apr 27 '23 at 14:19
  • @KingOtto it is only unsafe by the standards set by the numpy package. They define a safe cast as a cast where the value before is equal to the value after. `830.000001 != 830`. By default numpy does not perform this safety check, however, when pandas calls `numpy.ndarray.astype` internally they set the casting rule to 'safe'. – FluxZA May 02 '23 at 11:12