Different uses of the diff()
functions yields different results in my analysis. Why is it so and What do they signify isn't quite clear to me. Some help?
I am doing analysis of my behavioral experiment in jupyter notebook. For every participant I have trial-wise data of the no. of apple, rice and teak they have sown and harvested. I am trying to smooth and normalize the 'teak-share' (it's a farming simulation game and 'teak-share' is the difference of teak sown and teak harvested in each trial) and then find its difference between trials. However, when I use diff()
in two different ways yield two different results. Why is it so?
Scenario 1:
#working out correlation for participant Parika
name = 'Parika'
fname = name + '.xlsx'
data = pd.read_excel(fname)
data.columns = data.columns.str.rstrip()
data['apple-share'] = [ i for i in np.cumsum(data[:]['Apples sown'].values - data[:]['Apples reaped'].values).flatten()]
data['rice-share'] = [ i for i in np.cumsum(data[:]['Rice sown'].values - data[:]['Rice reaped'].values).flatten()]
data['teak-share'] = [ i for i in np.cumsum(data[:]['Teak sown'].values - data[:]['Teak reaped'].values).flatten()]
df = ((data['teak-share'].rolling(window=25, min_periods = 1, win_type='parzen', center=True).mean() - data['teak-share'][24:].mean())/data['teak-share'][24:].std()).diff()
df.plot(kind="line")
for x in data[data['Resource Cost']>5000]['Simulation No'].values:
plt.axvline(x, color='red', linestyle=':', linewidth=2)
plt.xticks(np.arange(0,120, step= 24), (data['Block'][0], data['Block'][24][0], data['Block'][48][0], data['Block'][72][0], data['Block'][96][0]))
N = range(5)
cumdev = 0
for n in N:
cumdev = cumdev + df[data[data['Resource Cost']>5000]['Simulation No'].values + n].sum()
print(cumdev)
plt.title("Smoothed")
plt.ylabel("Teak share")
plt.xlabel("Trials")
plt.show()
Here, df is calculated by smoothing then normalizing and then taking the 'diff()' Yields: plot = plot of teak share
Scenario 2:
#working out correlation for participant Parika
name = 'Parika'
fname = name + '.xlsx'
data = pd.read_excel(fname)
data.columns = data.columns.str.rstrip()
data['apple-share'] = [ i for i in np.cumsum(data[:]['Apples sown'].values - data[:]['Apples reaped'].values).flatten()]
data['rice-share'] = [ i for i in np.cumsum(data[:]['Rice sown'].values - data[:]['Rice reaped'].values).flatten()]
data['teak-share'] = [ i for i in np.cumsum(data[:]['Teak sown'].values - data[:]['Teak reaped'].values).flatten()]
df = ((data['teak-share'].rolling(window=25, min_periods = 1, win_type='parzen', center=True).mean() - data['teak-share'][24:].mean())/data['teak-share'][24:].std())
df.plot(kind="line")
for x in data[data['Resource Cost']>5000]['Simulation No'].values:
plt.axvline(x, color='red', linestyle=':', linewidth=2)
plt.xticks(np.arange(0,120, step= 24), (data['Block'][0], data['Block'][24][0], data['Block'][48][0], data['Block'][72][0], data['Block'][96][0]))
N = range(5)
cumdev = 0
for n in N:
cumdev = cumdev + df.diff()[data[data['Resource Cost']>5000]['Simulation No'].values + n].sum()
print(cumdev)
plt.title("Smoothed")
plt.ylabel("Teak share")
plt.xlabel("Trials")
plt.show()
Here, the df is calculated same as above but without the 'diff()'. The 'diff()' is done when calculating the cumdev. Yields: plot - plot of teak share
The red lines indicate the case where they face a budgetary overrun. Even though the cumdev comes out to be same in both cases, the plot are different. It is not clear to me why that would be the case. Please Help?