0

Would you please tell me what is wrong with the following as I get the error:

ValueError: cannot reindex a non-unique index with a method or limit

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import pandas_datareader as web



data= web.get_data_yahoo("BTC-USD",
                        start = "2015-01-01 ",
                        end = "2021-01-01   ")



btc_dailly_return= data['Adj Close'].pct_change()
btc_monthly_returns = data['Adj Close'].resample('M').ffill().pct_change()
FObersteiner
  • 22,500
  • 8
  • 42
  • 72
john
  • 27
  • 6

1 Answers1

2

When you use resample, you have to tell it how you would like to combine all the entries within the timeframe you chose. In your example, you're combining all the values within one month, you could combine them by adding them together, by taking the average, the standard devation, the maximum value, etc. So you have to tell Pandas what you would like to do by providing an additional method:

data['col'].resample('M').sum()
data['col'].resample('M').max()
data['col'].resample('M').mean()

In your case, last() is probably the most reasonable, so just change your last line to:

btc_monthly_returns = data['Adj Close'].resample('M').last().ffill().pct_change()

As to why the error only pops up with BTC-USD: that particular table has a duplicate date entry, causing ffill() to throw an error. last() (or any other reduction type aggregator) doesn't care about the duplicate.

Generally, resample('<method>').ffill() should be used for upsampling data, i.e. turning a list of months into a list of days. In that case ffill() would fill all the newly generated timestamps with the value from the previous valid timestamp. Your example downsamples, so a reducing aggregator like last, sum, or mean should be called.

pvandyken
  • 86
  • 4
  • Thanks. However, I wonder why this works as well: GOOG_monthly_returns = GOOG['Adj Close'].resample('M').ffill().pct_change() – john Sep 23 '21 at 20:46
  • Looked a bit more into it, first, sum is not at all what you want, sorry about that. It should be last if you want the change each month. As to why GOOG works, seems to be because the btc data has some duplicate dates (try running `any(pd.Index.duplicated(data.index))`. `ffill()` apparently can't handle that, I'm not totally sure why to be honest. I'd have to dig more into where the duplicates are – pvandyken Sep 23 '21 at 22:09
  • Still not completely sure why duplicate data is making `ffill()` through an error here (I've tried it in other contexts and it works fine). More generally, though, `ffill()` should be used when upsampling, i.e. adding in more timepoints than before. `ffill()` is used to fill in the gaps. In your case, you're downsampling, making `ffill()` inappropriate as the first aggregation function. I'll update the answer to include more of this – pvandyken Sep 23 '21 at 23:08