20

So I have a pandas dataframe object with column for money with two decimal places precision like "133.04". There are no numbers with 3 or more decimal places, only two.

My Try: Decimal module

I've tried to use Decimal module for this, but when I tried to re-sample it like this

gr_by_price = df['price'].resample(timeframe, how='ohlc')

I get

pandas.core.groupby.DataError: No numeric types to aggregate

Right before this I check dtype

print(type(df['price'][0]))
<class 'decimal.Decimal'>

I'm new to this library and money processing, maybe Decimal is not the right choice for this? What should I do?

If I cast this column to <class 'numpy.float64'> everything works.

Update: For now I'm using this method

d.Decimal("%0.2f" % float(d.Decimal("1.04")))
Decimal('1.04')

From this question

Charles
  • 3,116
  • 2
  • 11
  • 20
userqwerty1
  • 887
  • 2
  • 9
  • 23
  • 1
    You'd need to use `np.float64` for this unfortunately, so long as the precision and limits are not exceeded you should be fine – EdChum Apr 09 '15 at 09:50
  • @EdChum hmm.. I will not end up with 133.04 becoming 133.05 or 133.03, will I? So I cast it to float64 right before resampling, resample and cast to Decimal again, right? – userqwerty1 Apr 09 '15 at 09:56
  • That may possibly happen but usually the imprecision occurs at the lower digits but if you cast to Decimal at the end it should clip this – EdChum Apr 09 '15 at 09:58
  • @EdChum Thank you. I will do this way for now. And here is fun part `>>> d.Decimal(float(d.Decimal("1.04"))) Decimal('1.04000000000000003552713678800500929355621337890625')` – userqwerty1 Apr 09 '15 at 10:05
  • Most of the time what you want to do is store the numbers as floats and then use appropriate formats to display. The fun is happening at the 12th or 13th decimal, so it's rarely an issue in practice. Decimal is not a core dtype (like int or float) so it can be a pain to work with. Note that outside of core dtypes pandas stores things as objects. Use ```info()``` method to check dtypes. – JohnE Apr 09 '15 at 12:47

4 Answers4

22

We had a similar problem; the best idea was to multiply it by 100 and represent it as an integer (and use /100 for print/external options). It will result in fast, exact computations (1 + 2 == 3 unlike 0.1 + 0.2 != 0.3)

wjandrea
  • 28,235
  • 9
  • 60
  • 81
Alex Ozerov
  • 988
  • 8
  • 21
8

I have had this problem in the past as well, and the solution I ended up using was representing the currency as an multiples of it's lowest denomination (ie, one cent for USD). Thus, the type would be int. The advantage of this method is, as mentioned here already, that you can perform lossless integer computations.

Price (currency) = Multiplyer * Sub_unit

Eg. for the USD, the unit of the price would be the dollar, and the subunit would be one cent, making the multiplier 100.

Another aspect I would like to mention is the fact that this works well across different currencies. For instance, the smallest denomination of the Yen is 1 yen, in which case the multiplier is 1. The Indonesian Rupiah's smallest denomination is 1000 rupiah, so the multiplier can be 1 as well. You just need to memorize the multiplier for each currency.

In fact, you could even make a custom class that just wraps this conversion for you, that may be the most convenient solution.

Charles
  • 3,116
  • 2
  • 11
  • 20
4

You need to distinguish the internal value representation and the way you present it (read more on MVC here). As you stated that you do not need other type of floating number representation, I would recommend to continue using regular float for internal representation and math (it's IEEE-754 standard) and just add this line

pd.options.display.float_format = '{:6.2f}'.format

in the beginning of your script. This will make all the printed values automatically be round up to second digits, without actually changing their values. (pd is a common alias for pandas).

igrinis
  • 12,398
  • 20
  • 45
1

Decimal seems like a pretty reasonable representation for your use case. The underlying problem here is that the ohlc aggregator in pandas calls cython for speed and I assume that cython isn't able to take Decimals. See here: https://github.com/pandas-dev/pandas/blob/v0.20.3/pandas/core/groupby.py#L1203-L1212

Insead, I think the most straightforward way would be to simply write ohlc yourself so that it can operate on Decimals

In [89]: index = pd.date_range('1/1/2000', periods=9, freq='T')

In [90]: series = pd.Series(np.linspace(0, 2, 9), index=index)

In [91]: series.resample('3T').ohlc()
Out[91]:
                     open  high   low  close
2000-01-01 00:00:00  0.00  0.50  0.00   0.50
2000-01-01 00:03:00  0.75  1.25  0.75   1.25
2000-01-01 00:06:00  1.50  2.00  1.50   2.00

In [92]: decimal_series = pd.Series([Decimal(x) for x in np.linspace(0, 2, 9)], index=index)

In [93]: def ohlc(x):
    ...:     x = x[x.notnull()]
    ...:     if x.empty:
    ...:         return pd.Series({'open': np.nan, 'high': np.nan, 'low': np.nan, 'close': np.nan})
    ...:     return pd.Series({'open': x.iloc[0], 'high': x.max(), 'low': x.min(), 'close':x.iloc[-1]})
    ...:
In [107]: decimal_series.resample('3T').apply(ohlc).unstack()
Out[107]:
                    close  high   low  open
2000-01-01 00:00:00   0.5   0.5     0     0
2000-01-01 00:03:00  1.25  1.25  0.75  0.75
2000-01-01 00:06:00     2     2   1.5   1.5
Dan Frank
  • 1,887
  • 1
  • 18
  • 13