1

I'm manually passing specific values in a pandas df to a function. This is fine but I'm hoping to make the process more efficient. Specifically, I first subset all consecutive values in Item. I then take the respective values in Val and pass them to func. This produces the value I need.This is ok for smaller df's but become inefficient for larger datasets.

I'm just hoping to make this process more efficient to applying the values to the original df.

import pandas as pd
import numpy as np

df = pd.DataFrame({ 
            'Time' : ['1','2','3','4','5','6','7','8','9','10','11','12','13','14','15'],                   
            'Val' : [35,38,31,30,35,31,32,34,36,38,39,30,25,26,27],                   
            'Item' : ['X','X','X','X','X','Y','Y','Y','Y','Y','Y','X','X','X','X'],  
                    })

df1 = df.groupby([df['Item'].ne(df['Item'].shift()).cumsum(), 'Item']).size()

X1 = df[0:5]
Y1 = df[5:11]
X2 = df[11:15]

V1 = X1['Val1'].reset_index(drop = True)
V2 = Y1['Val1'].reset_index(drop = True)
V3 = X2['Val1'].reset_index(drop = True)

def func(U, m = 2, r = 0.2):

        def _maxdist(x_i, x_j):
            return max([abs(ua - va) for ua, va in zip(x_i, x_j)])

        def _phi(m):
            x = [[U[j] for j in range(i, i + m - 1 + 1)] for i in range(N - m + 1)]
            C = [len([1 for x_j in x if _maxdist(x_i, x_j) <= r]) / (N - m + 1.0) for x_i in x]
            return (N - m + 1.0)**(-1) * sum(np.log(C))

        N = len(U)

        return abs(_phi(m + 1) - _phi(m))

print(func(V1))
print(func(V2))
print(func(V3))

out:

0.287682072452
0.223143551314
0.405465108108

If I just try to apply the function using groupby it returns KeyError: 0. The function doesn't work unless I reset the index.

df1 = df.groupby(['Item']).apply(func)

KeyError: 0

Intended Output:

   Time  Val1 Item   func
0     1    35    X  0.287
1     2    38    X  0.287
2     3    31    X  0.287
3     4    30    X  0.287
4     5    35    X  0.287
5     6    31    Y  0.223
6     7    32    Y  0.223
7     8    34    Y  0.223
8     9    36    Y  0.223
9    10    38    Y  0.223
10   11    39    Y  0.223
11   12    30    X  0.405
12   13    25    X  0.405
13   14    26    X  0.405
14   15    27    X  0.405
jonboy
  • 415
  • 4
  • 14
  • 45

3 Answers3

2

The issue is at U[j] in the _phi function. Its j is the positional index, so you may use U.iloc[j] or change it to list and working straight from list. It seems working on list faster than using iloc. My fix changes it to list and working on list. The line x = ... in _phi could also use a few modifications to make it shorter.

Method 1:

def func(U, m = 2, r = 0.2):

    def _maxdist(x_i, x_j):
        return max([abs(ua - va) for ua, va in zip(x_i, x_j)])

    def _phi(m):
        x = [U.tolist()[i:i + m] for i in range(N - m + 1)] #change at this line
        C = [len([1 for x_j in x if _maxdist(x_i, x_j) <= r]) / (N - m + 1.0) for x_i in x]
        return (N - m + 1.0)**(-1) * sum(np.log(C))

    N = len(U)

    return abs(_phi(m + 1) - _phi(m))

Create custom groupID s as you did and groupby on s and call transform

s = df['Item'].ne(df['Item'].shift()).cumsum()
df['func'] = df.groupby(s).Val.transform(func)

Out[1090]:
   Time  Val Item      func
0     1   35    X  0.287682
1     2   38    X  0.287682
2     3   31    X  0.287682
3     4   30    X  0.287682
4     5   35    X  0.287682
5     6   31    Y  0.223144
6     7   32    Y  0.223144
7     8   34    Y  0.223144
8     9   36    Y  0.223144
9    10   38    Y  0.223144
10   11   39    Y  0.223144
11   12   30    X  0.405465
12   13   25    X  0.405465
13   14   26    X  0.405465
14   15   27    X  0.405465

Method 2: It is shorter but less readable. Use as_strided from numpy.lib.stride_tricks

def func(U, m = 2, r = 0.2):

    def _phi(m):
        strd = U.to_numpy().strides[0]
        x = as_strided(U.to_numpy(), (N-m+1, m), (strd, strd))
        C = (np.abs(x - x[:,None]).max(-1) <= r).sum(-1) / (N - m + 1.0)    
        return np.sum(np.log(C)) / (N - m + 1.0)

    N = len(U)

    return abs(_phi(m + 1) - _phi(m))      

You need to import as_strided and create groupID and call groupby transform as method 1

from numpy.lib.stride_tricks import as_strided

s = df['Item'].ne(df['Item'].shift()).cumsum()
df['func'] = df.groupby(s).Val.transform(func)
Andy L.
  • 24,909
  • 4
  • 17
  • 29
1

It seems that your are using apply with func as is, but func is not prepared to receive the whole slice of the dataframe directly. In this cases, lambda expressions are useful.

You could do as follows:

# Fisrt, convert each item (string) to a unique value (integer) (based on solution here: https://stackoverflow.com/questions/31701991/string-of-text-to-unique-integer-method)
df['ItemID'] = df['Item'].apply(lambda s: int.from_bytes(s.encode(), 'little'))

# Get the consecutive items (based on solution here: https://stackoverflow.com/questions/26911851/how-to-use-pandas-to-find-consecutive-same-data-in-time-series)
ItemConsecutive = (np.diff(df['ItemID'].values) != 0).astype(int).cumsum()
ItemConsecutive = np.insert(ItemConsecutive, 0, ItemConsecutive[0])
df['ItemConsecutive'] = ItemConsecutive

# Define your custom func (unmodified)
def func(U, m = 2, r = 0.2):
    def _maxdist(x_i, x_j):
        return max([abs(ua - va) for ua, va in zip(x_i, x_j)])
    def _phi(m):
        x = [[U[j] for j in range(i, i + m - 1 + 1)] for i in range(N - m + 1)]
        C = [len([1 for x_j in x if _maxdist(x_i, x_j) <= r]) / (N - m + 1.0) for x_i in x]
        return (N - m + 1.0)**(-1) * sum(np.log(C))
    N = len(U)
    return abs(_phi(m + 1) - _phi(m))

# Get your calculated values with func based on each consecutive item
func_values = df.groupby('ItemConsecutive').apply(lambda x: func(x['Val'].reset_index(drop=True)))
func_values.name = 'func'

# Complete the dataframe with you calculated values
df = df.join(func_values, on='ItemConsecutive')

This is the result:

   Item Time  Val  ItemID  ItemConsecutive      func
0     X    1   35      88                0  0.287682
1     X    2   38      88                0  0.287682
2     X    3   31      88                0  0.287682
3     X    4   30      88                0  0.287682
4     X    5   35      88                0  0.287682
5     Y    6   31      89                1  0.223144
6     Y    7   32      89                1  0.223144
7     Y    8   34      89                1  0.223144
8     Y    9   36      89                1  0.223144
9     Y   10   38      89                1  0.223144
10    Y   11   39      89                1  0.223144
11    X   12   30      88                2  0.405465
12    X   13   25      88                2  0.405465
13    X   14   26      88                2  0.405465
14    X   15   27      88                2  0.405465

BTW, I'm using pandas version 0.23.3

alan.elkin
  • 954
  • 1
  • 10
  • 19
0

One need to use apply after the groupby: https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html

df1 = df.groupby(['Item']).apply( lambda x : myfunc(x) )

myfunc operates on sub-dataframes which are grouped by 'Item'.

tensor
  • 3,088
  • 8
  • 37
  • 71