In order to save memory, I started looking into downcasting numeric column types in pandas
.
In the quest of saving memory, I would like to convert object columns to e.g. float32 or float16 instead of the automatic standard float64
, or int32, int16, or int8 instead of (the automatic integer standard format) int64
etc.
However, this means that high numbers cannot be displayed or saved correctly when certain values within the column/series exceed specific limits. More details on this can be seen in the data type docs.
For instance int16
stands for Integer (-32768 to 32767)
.
While playing around with extremely large numbers, I figured that pd.to_numeric() doesn't have any means to prevent such very high numbers from being coerced to a placeholder called inf
which can also be produced manually via float("inf")
.
In the following specific example, I'm going to demonstrate that one specific value in the first column, namely 10**100
will only be displayed correctly in the float64
format, but not using float32
. My concern is in particular, that upon using pd.to_numeric(downcast="float")
this function doesn't tell the user that it converts high numbers to inf
behind the scences, which leads as a consequence to a silent loss of information which is clearly undesired, even if memory can be saved this way.
In[45]:
# Construct an example dataframe
df = pd.DataFrame({"Numbers": [100**100, 6, 8], "Strings": ["8.0", "6", "7"]})
# Print out user info
print(df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Numbers 3 non-null object
1 Strings 3 non-null object
dtypes: object(2)
memory usage: 176.0+ bytes
None
# Undesired result obtained by downcasting
pd.to_numeric(df["Numbers"], errors="raise", downcast="float")
Out[46]:
0 inf
1 6.0
2 8.0
Name: Numbers, dtype: float32
# Correct result without downcasting
pd.to_numeric(df["Numbers"], errors="raise")
Out[47]:
0 1.000000e+200
1 6.000000e+00
2 8.000000e+00
Name: Numbers, dtype: float64
I would strongly prefer that pd.to_numeric()
would avoid automatically values being coerced to inf
since this significates a loss of information. It seems like its priority is just to save memory no matter what.
There should be a built-in method to avoid this coercion producing information loss. Of course, I could test it afterwards and convert it to the highest precision as a corrective measure, like so:
In[61]:
# Save to temporary "dummy" series as otherwise, the infinity values would override the real values and the info would be lost already
dummy_series = pd.to_numeric(df["Numbers"], errors="raise", downcast="float")
## Check for the presence of undesired inf-values ##
# i) inf-values produces: avoid downcasting
if float("inf") in dummy_series.values:
print("\nInfinity values are present!\nTry again without downcasting.\n")
df["Numbers"] = pd.to_numeric(df["Numbers"], errors="raise")
# ii) If there is no inf-value, adopt the downcasted series as is
else:
df["Numbers"] = dummy_series
# Check result
print(df["Numbers"])
Out[62]:
Infinity values are present!
Try again without downcasting.
0 1.000000e+200
1 6.000000e+00
2 8.000000e+00
Name: Numbers, dtype: float64
This doesn't seem very pythonic to me though, and I bet there must a better built-in solution either in pandas
or numpy
directly.