4

Below you find the code I wrote to calculate a relative change in value of df.a and df.b while df is a dataframe. What has to be calculated is basically df["c"] = df.a/df.a.iloc[df.d].values. df.d is set equal to df.t if df.a/df.a.iloc[df.d].values is bigger or smaller than df.b/df.b.iloc[df.d].values * (1+ tolerance)

The problem is that the code currently brings the following error code: ValueError: ('The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().', u'occurred at index 2011-01-01 00:00:00')and I absolutely don't know why...

import pandas as pd
import numpy as np
import datetime

randn = np.random.randn
rng = pd.date_range('1/1/2011', periods=10, freq='D')

df = pd.DataFrame({'a': [1.1, 1.2, 2.3, 1.4, 1.5, 1.8, 0.7, 1.8, 1.9, 2.0], 'b': [1.1, 1.5, 1.3, 1.6, 1.5, 1.1, 1.5, 1.7, 2.1, 2.1],'c':[None] * 10},index=rng)

df["d"]= [0,0,0,0,0,0,0,0,0,0]
df["t"]= np.arange(len(df))
tolerance = 0.3

def set_t(x):
    if df.a/df.a.iloc[df.d].values < df.b/df.b.iloc[df.d].values * (1+tolerance):
        return  df.iloc[df.index.get_loc(x.name) - 1]['d'] == df.t
    elif df.a/df.a.iloc[df.d].values > df.b/df.b.iloc[df.d].values * (1+tolerance):
        return df.iloc[df.index.get_loc(x.name) - 1]['d'] == df.t

#The conditions in part one are exactly the same as in part 2, only first it says smaller than, and in the second part is bigger than df.b/df.b.iloc[df.d].values * (1+tolerance)


df['d'] = df.apply(set_t, axis =1)

#df["d"]= [0,0,0,3,3,3,6,7,7,7] this should be the coutcome for d

df["c"] = df.a/df.a.iloc[df.d].values 

The application of (df.a/df.a.iloc[df.d].values).all() < (df.b/df.b.iloc[df.d].values).all() or .any() does not lead to the deired outcome, since it only checks when the data currently set is TRUE or FALSE, but it does not set the new value.

The desired outcome looks like this:

              a    b         c  d  t
2011-01-01  1.1  1.1  1.000000  0  0
2011-01-02  1.2  1.5  1.090909  0  1
2011-01-03  2.3  1.3  2.090909  0  2
2011-01-04  1.4  1.6  1.000000  3  3
2011-01-05  1.5  1.5  1.071429  3  4
2011-01-06  1.8  1.1  1.285714  3  5
2011-01-07  0.7  1.5  1.000000  6  6
2011-01-08  1.8  1.7  1.000000  7  7
2011-01-09  1.9  2.1  1.055556  7  8
2011-01-10  2.0  2.1  1.111111  7  9

Any ideas how to solve that?

hb.klein
  • 381
  • 1
  • 4
  • 16
  • 2
    This condition `df.a/df.a.iloc[df.d].values < df.b/df.b.iloc[df.d].values * (1+tolerance)` returns an array, not a single value. You can not do `if [True, True, False]:` (or anything of sort). You have to provide python with a way to convert that array into a single value. That's usually done with `all` or `any`. – Korem Jun 06 '15 at 14:05
  • Your condition returns a `pandas.core.series.Series` so what should `if pandas.core.series.Series` evaluate to? – Padraic Cunningham Jun 06 '15 at 14:08
  • @Padraic Cunningham: How would I apply this? – hb.klein Jun 06 '15 at 14:20
  • This isn't your core problem, but you want to use iloc like this: `iloc[x,y]` **not** like this: `iloc[x][y]` which largely defeats the point of iloc. – JohnE Jun 06 '15 at 15:09
  • Your conditions aren't quite clear. eg. in the function set_t you seem to be returning the same values. I believe if values in 'a' are less than values in 'b' * 1.3 then return something else return something right? But what is that something that you want to return? – gabhijit Jun 06 '15 at 15:33
  • @ gabhijit: the conditions are the same, only one time its bigger, the second time it's smaller than `df.b/df.b.iloc[df.d].values * (1+tolerance)` - so you could also write bigger or smaller than - but how do I do that in one line? – hb.klein Jun 06 '15 at 19:19

2 Answers2

2

This is not a 100% solution, but should at least get you down a better path and fix a primary problem. The core problem I'm seeing here from a syntax side is that you are trying to mix vectorized and non-vectorized code. You could instead do something more like this:

>>> df['d1'] = df.a/df.a.iloc[df.d].values > df.b/df.b.iloc[df.d].values * (1+tolerance)

>>> df['d2'] = df.a/df.a.iloc[df.d].values * (1+tolerance) < df.b/df.b.iloc[df.d].values

>>> df['d'] = df['d1'] | df['d2']

>>> df

              a    b     c      d  t     d1     d2
2011-01-01  1.1  1.1  None  False  0  False  False
2011-01-02  1.2  1.5  None  False  1  False  False
2011-01-03  2.3  1.3  None   True  2   True  False
2011-01-04  1.4  1.6  None  False  3  False  False
2011-01-05  1.5  1.5  None  False  4  False  False
2011-01-06  1.8  1.1  None   True  5   True  False
2011-01-07  0.7  1.5  None   True  6  False   True
2011-01-08  1.8  1.7  None  False  7  False  False
2011-01-09  1.9  2.1  None  False  8  False  False
2011-01-10  2.0  2.1  None  False  9  False  False

That's not quite the answer you want, but hopefully shows you what is going on with the code and how you can fix it to get what you want (i.e. you don't need or want to be using a function and applying it here, just use standard pandas vectorized code).

If you can get that to work, the cleaner way to do that would be with np.where (either two of them sequentially, or nested).

JohnE
  • 29,156
  • 8
  • 79
  • 109
  • That's it! I would only need the values instead of `True` or `False` in one row – hb.klein Jun 06 '15 at 19:48
  • we have here level 2 of this question, even with a bounty: http://stackoverflow.com/questions/30745160/portfolio-rebalancing-with-bandwidth-method-in-python – hb.klein Jun 12 '15 at 06:46
2

Ok, I get the result you want, but this is still way too complicated and unefficient. I would be interested to see a superior solution:

import pandas as pd
import numpy as np
import datetime

randn = np.random.randn


rng = pd.date_range('1/1/2011', periods=10, freq='D')

df = pd.DataFrame({'a': [1.1, 1.2, 2.3, 1.4, 1.5, 1.8, 0.7, 1.8, 1.9, 2.0], 'b': [1.1, 1.5, 1.3, 1.6, 1.5, 1.1, 1.5, 1.7, 2.1, 2.1],'c':[None] * 10},index=rng)



df["d"]= [0,0,0,0,0,0,0,0,0,0]



df["t"]= np.arange(len(df))
tolerance = 0.3

df['d1'] = df.a/df.a.iloc[df.d].values > df.b/df.b.iloc[df.d].values * (1+tolerance)

df['d2'] = df.a/df.a.iloc[df.d].values * (1+tolerance) < df.b/df.b.iloc[df.d].values



df['e'] = df.d1*df.t
df['f'] = df.d2*df.t
df['g'] = df.e +df.f
df.ix[df.g > df.g.shift(1),"h"] = df.g * 1; df
df.h = df.h + 1
df.h = df.h.shift(1)
df['h'][0] = 0

df.h.fillna(method='ffill',inplace=True)
df["d"] = df.h
df["c"] = df.a/df.a.iloc[df.d].values

and that's the result:

              a    b         c  d  t     d1     d2  e  f  g  h
2011-01-01  1.1  1.1  1.000000  0  0  False  False  0  0  0  0
2011-01-02  1.2  1.5  1.090909  0  1  False  False  0  0  0  0
2011-01-03  2.3  1.3  2.090909  0  2   True  False  2  0  2  0
2011-01-04  1.4  1.6  1.000000  3  3  False  False  0  0  0  3
2011-01-05  1.5  1.5  1.071429  3  4  False  False  0  0  0  3
2011-01-06  1.8  1.1  1.285714  3  5   True  False  5  0  5  3
2011-01-07  0.7  1.5  1.000000  6  6  False   True  0  6  6  6
2011-01-08  1.8  1.7  1.000000  7  7  False  False  0  0  0  7
2011-01-09  1.9  2.1  1.055556  7  8  False  False  0  0  0  7
2011-01-10  2.0  2.1  1.111111  7  9  False  False  0  0  0  7

from here you can easily delete rows with e.g. del df['g']

sorownas
  • 157
  • 8
  • This solution only works for this example. df.d has to be calculated "on the fly and has to be sett as soon as df.d1 or df.d2 occurs to be true, since e.g. df.d1= True could occur several times in a row which would each time to a resetting of df.d = df.t – hb.klein Jun 07 '15 at 03:32