7

Why does Pandas coerce my numpy float32 to float64 in this piece of code:

>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame([[1, 2, 'a'], [3, 4, 'b']], dtype=np.float32)
>>> A = df.ix[:, 0:1].values
>>> df.ix[:, 0:1] = A
>>> df[0].dtype
dtype('float64')

The behavior seems so odd to me that wonder if it is a bug. I am on Pandas version 0.17.1 (updated PyPI version) and I note there has been coercing bugs recently addressed, see https://github.com/pydata/pandas/issues/11847 . I haven't tried the piece of code with an updated GitHub master.

Is it a bug or do I misunderstand some "feature" in Pandas? If it is a feature, then how do I get around it?

(The coercing problem relates to a question I recently asked about the performance of Pandas assignments: Assignment of Pandas DataFrame with float32 and float64 slow)

Finn Årup Nielsen
  • 6,130
  • 1
  • 33
  • 43
  • It may be odd but it is consistent with numpy. Numpy automatically turns even integers into numpy.float64 types. Since Pandas has numpy at the core, this functionality is expected IMO (although certainly not ideal in your case). – Benji Feb 05 '16 at 18:47
  • But 'pandas' has a greater propensity to use 'dtype=object' than plain `numpy`. It gives it greater flexibility when handling mixed types - strings can be any length, columns can mix types, etc. But the flexibility comes with computational and memory costs. – hpaulj Feb 05 '16 at 19:16

2 Answers2

3

I think it is worth posting this as a GitHub issue. The behavior is certainly inconsistent.

The code takes a different branch based on whether the DataFrame is mixed-type or not (source).

  • In the mixed-type case the ndarray is converted to a Python list of float64 numbers and then converted back into float64 ndarray disregarding the DataFrame's dtypes information (function maybe_convert_objects()).

  • In the non-mixed-type case the DataFrame content is updated pretty much directly (source) and the DataFrame keeps its float32 dtypes.

Martin Valgur
  • 5,793
  • 1
  • 33
  • 45
2

Not an answer, but my recreation of the problem:

In [2]: df = pd.DataFrame([[1, 2, 'a'], [3, 4, 'b']], dtype=np.float32)
In [3]: df.dtypes
Out[3]: 
0    float32
1    float32
2     object
dtype: object
In [4]: A=df.ix[:,:1].values
In [5]: A
Out[5]: 
array([[ 1.,  2.],
       [ 3.,  4.]], dtype=float32)
In [6]: df.ix[:,:1] = A
In [7]: df.dtypes
Out[7]: 
0    float64
1    float64
2     object
dtype: object
In [8]: pd.__version__
Out[8]: '0.15.0'

I'm not as familiar with pandas as numpy, but I'm puzzled as to why ix[:,:1] gives me a 2 column result. In numpy that sort of indexing gives just 1 column.

If I assign a single column dtype does not change

In [47]: df.ix[:,[0]]=A[:,0]
In [48]: df.dtypes
Out[48]: 
0    float32
1    float32
2     object

The same actions without mixed datatypes does not change dtypes

In [100]: df1 = pd.DataFrame([[1, 2, 1.23], [3, 4, 3.32]], dtype=np.float32)
In [101]: A1=df1.ix[:,:1].values
In [102]: df1.ix[:,:1]=A1
In [103]: df1.dtypes
Out[103]: 
0    float32
1    float32
2    float32
dtype: object

The key must be that with mixed values, the dataframe is, in one sense or other, a dtype=object array, whether that's true of its internal data storage, or just its numpy interface.

In [104]: df1.as_matrix()
Out[104]: 
array([[ 1.        ,  2.        ,  1.23000002],
       [ 3.        ,  4.        ,  3.31999993]], dtype=float32)
In [105]: df.as_matrix()
Out[105]: 
array([[1.0, 2.0, 'a'],
       [3.0, 4.0, 'b']], dtype=object)
hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • Assignment with a single column and a for-loop over column names seems to give reasonable performance for "within-type" (non-casting) assignment and yields correct type. However that method is over twice as slow if there is casting to and from float32 and float64. I suppose multiple reallocations would explain the latter problem. – Finn Årup Nielsen Feb 09 '16 at 13:47