0

In order to save memory, I started looking into downcasting numeric column types in pandas.

In the quest of saving memory, I would like to convert object columns to e.g. float32 or float16 instead of the automatic standard float64, or int32, int16, or int8 instead of (the automatic integer standard format) int64 etc.

However, this means that high numbers cannot be displayed or saved correctly when certain values within the column/series exceed specific limits. More details on this can be seen in the data type docs. For instance int16 stands for Integer (-32768 to 32767).

While playing around with extremely large numbers, I figured that pd.to_numeric() doesn't have any means to prevent such very high numbers from being coerced to a placeholder called inf which can also be produced manually via float("inf"). In the following specific example, I'm going to demonstrate that one specific value in the first column, namely 10**100 will only be displayed correctly in the float64 format, but not using float32. My concern is in particular, that upon using pd.to_numeric(downcast="float") this function doesn't tell the user that it converts high numbers to inf behind the scences, which leads as a consequence to a silent loss of information which is clearly undesired, even if memory can be saved this way.

In[45]:
# Construct an example dataframe
df = pd.DataFrame({"Numbers": [100**100, 6, 8], "Strings": ["8.0", "6", "7"]})

# Print out user info
print(df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Numbers  3 non-null      object
 1   Strings  3 non-null      object
dtypes: object(2)
memory usage: 176.0+ bytes
None

# Undesired result obtained by downcasting
pd.to_numeric(df["Numbers"], errors="raise", downcast="float")
Out[46]: 
0    inf
1    6.0
2    8.0
Name: Numbers, dtype: float32

# Correct result without downcasting
pd.to_numeric(df["Numbers"], errors="raise")
Out[47]: 
0    1.000000e+200
1     6.000000e+00
2     8.000000e+00
Name: Numbers, dtype: float64

I would strongly prefer that pd.to_numeric() would avoid automatically values being coerced to inf since this significates a loss of information. It seems like its priority is just to save memory no matter what.

There should be a built-in method to avoid this coercion producing information loss. Of course, I could test it afterwards and convert it to the highest precision as a corrective measure, like so:

In[61]:
# Save to temporary "dummy" series as otherwise, the infinity values would override the real values and the info would be lost already
dummy_series = pd.to_numeric(df["Numbers"], errors="raise", downcast="float")

## Check for the presence of undesired inf-values ##
# i) inf-values produces: avoid downcasting
if float("inf") in dummy_series.values:
    print("\nInfinity values are present!\nTry again without downcasting.\n")
    df["Numbers"] = pd.to_numeric(df["Numbers"], errors="raise")

# ii) If there is no inf-value, adopt the downcasted series as is
else:
    df["Numbers"] = dummy_series

# Check result
print(df["Numbers"])

Out[62]:
Infinity values are present!
Try again without downcasting.

0    1.000000e+200
1     6.000000e+00
2     8.000000e+00
Name: Numbers, dtype: float64

This doesn't seem very pythonic to me though, and I bet there must a better built-in solution either in pandas or numpy directly.

Andreas L.
  • 3,239
  • 5
  • 26
  • 65
  • It's hard to follow what exactly you want. I think you want to transform a `pd.Series` object (that is a column). Can you provide valid Series instances of the input and the desired output? Something like: `input = pd.Series([...], dtype=...)`, wanted = pd.Series([...], dtype=...)` where you fill out the dots. – Han-Kwang Nienhuys Jul 15 '20 at 17:56
  • `input = pd.Series([10**100, 2.44], dtype="object")` --> `wanted = pd.Series([10**100, 2.44], dtype=float64 OR float32 OR float16 ...)` depending on what's possible without losing information when large numbers are just converted to infinity (`inf`). See, I want to save memory, that's all I want to achieve. I assume there must be a method which automatically detects what's the least memory-consuming format possible which is still able to display all numbers correctly (and not having unwanted results like "infinity" (like `float32` with `10*100` -> `inf`) – Andreas L. Jul 15 '20 at 19:49
  • Could you please update the question with the input/output and be unambiguous in the dtype? Use multiple input/wanted pairs if you need. Make sure that the `wanted` Series are valid data (no errors if you run them). – Han-Kwang Nienhuys Jul 15 '20 at 20:04
  • No problem, I hope now it's gotten clearer what I aim for. Let me know if you need more specifics. – Andreas L. Jul 15 '20 at 20:22
  • I don't see unambiguous input/wanted pairs in the updated question. – Han-Kwang Nienhuys Jul 15 '20 at 20:24
  • Self-quote: "In the quest of saving memory, I would like to convert object columns to e.g. float32 or float16 instead of the automatic standard float64, or int32, int16, or int8 instead of (the automatic integer standard format) int64 etc." (and then my example below tries to convert an object-column to the smallest float-format possible via the downcast option of pd.to_numeric(), which leads to a resulting float32 but with the cost of losing information (the value 10**100 was converted silently to infinity)) – Andreas L. Jul 15 '20 at 20:27
  • What do you think of my answer? – Han-Kwang Nienhuys Jul 16 '20 at 16:50
  • Thx for your effort, see comment below your answer. – Andreas L. Jul 27 '20 at 13:59

1 Answers1

0

For float16, float32, and float64, the maximum values are known. So, you can just look at the maximum value and decide the datatype based on that:


import numpy as np

cases = [[1e100, 6, 8],
         [10**100, 6, 8],
         [1e36, 6, 8],
         [-32760, 6, 8],
         [10**500, 6, 8],
         ]

maxfloats = [(65504, np.float16), (3.402e38, np.float32), (1.797e308, np.float64)]


for input_list in cases:
    
    input_s = pd.Series(np.array(input_list, dtype=np.object))
    maxval = np.abs(input_s).max()
    for dtype_max, dtype in maxfloats:
        if maxval < dtype_max:
            break
    else:
        dtype = np.object
    
    out_array = np.array(input_s, dtype=dtype)
    out_s = pd.Series(out_array)
    print(f'Input:\n{input_s}\nOutput:\n{out_s}\n----')

Result:

Input:
0    1e+100
1         6
2         8
dtype: object
Output:
0    1.000000e+100
1     6.000000e+00
2     8.000000e+00
dtype: float64
----
Input:
0    1000000000000000000000000000000000000000000000...
1                                                    6
2                                                    8
dtype: object
Output:
0    1.000000e+100
1     6.000000e+00
2     8.000000e+00
dtype: float64
----
Input:
0    1e+36
1        6
2        8
dtype: object
Output:
0    1.000000e+36
1    6.000000e+00
2    8.000000e+00
dtype: float32
----
Input:
0    -32760
1         6
2         8
dtype: object
Output:
0   -32768.0
1        6.0
2        8.0
dtype: float16
----
Input:
0    1000000000000000000000000000000000000000000000...
1                                                    6
2                                                    8
dtype: object
Output:
0    1000000000000000000000000000000000000000000000...
1                                                    6
2                                                    8
dtype: object
Han-Kwang Nienhuys
  • 3,084
  • 2
  • 12
  • 31
  • It's a workaround like mine, just that you looked up manually the limit-values for each data-type. I would prefer to have an internal feature of `pd.to_numeric()`, another `built-in function` or anything else more pythonic. Moreover, by using these workarounds I'd have to apply them everytime on purpose which adds additional computation time and coding. Also, I'm not sure whether these limit values for each data-type are constants untouched and unchanged for ever and always, or if this could change at some point in time making the workaround obsolete without noticing. – Andreas L. Jul 27 '20 at 14:03