Different uses of diff() yield different results. Why and What do they signify?

Question

Different uses of the diff() functions yields different results in my analysis. Why is it so and What do they signify isn't quite clear to me. Some help?

I am doing analysis of my behavioral experiment in jupyter notebook. For every participant I have trial-wise data of the no. of apple, rice and teak they have sown and harvested. I am trying to smooth and normalize the 'teak-share' (it's a farming simulation game and 'teak-share' is the difference of teak sown and teak harvested in each trial) and then find its difference between trials. However, when I use diff() in two different ways yield two different results. Why is it so?

Scenario 1:

#working out correlation for participant Parika

name = 'Parika'
fname = name + '.xlsx' 
data = pd.read_excel(fname)
data.columns = data.columns.str.rstrip()

data['apple-share'] =  [ i for i in np.cumsum(data[:]['Apples sown'].values - data[:]['Apples reaped'].values).flatten()]
data['rice-share'] =  [ i for i in np.cumsum(data[:]['Rice sown'].values - data[:]['Rice reaped'].values).flatten()]
data['teak-share'] =  [ i for i in np.cumsum(data[:]['Teak sown'].values - data[:]['Teak reaped'].values).flatten()]


df = ((data['teak-share'].rolling(window=25, min_periods = 1, win_type='parzen', center=True).mean() - data['teak-share'][24:].mean())/data['teak-share'][24:].std()).diff()
df.plot(kind="line")
for x in data[data['Resource Cost']>5000]['Simulation No'].values:
    plt.axvline(x, color='red', linestyle=':', linewidth=2)
    plt.xticks(np.arange(0,120, step= 24), (data['Block'][0], data['Block'][24][0], data['Block'][48][0], data['Block'][72][0], data['Block'][96][0]))

N = range(5)
cumdev = 0
for n in N:
    cumdev = cumdev + df[data[data['Resource Cost']>5000]['Simulation No'].values + n].sum()

print(cumdev)
plt.title("Smoothed")
plt.ylabel("Teak share")
plt.xlabel("Trials")
plt.show()

Here, df is calculated by smoothing then normalizing and then taking the 'diff()' Yields: plot = plot of teak share

Scenario 2:

#working out correlation for participant Parika

name = 'Parika'
fname = name + '.xlsx' 
data = pd.read_excel(fname)
data.columns = data.columns.str.rstrip()

data['apple-share'] =  [ i for i in np.cumsum(data[:]['Apples sown'].values - data[:]['Apples reaped'].values).flatten()]
data['rice-share'] =  [ i for i in np.cumsum(data[:]['Rice sown'].values - data[:]['Rice reaped'].values).flatten()]
data['teak-share'] =  [ i for i in np.cumsum(data[:]['Teak sown'].values - data[:]['Teak reaped'].values).flatten()]


df = ((data['teak-share'].rolling(window=25, min_periods = 1, win_type='parzen', center=True).mean() - data['teak-share'][24:].mean())/data['teak-share'][24:].std())
df.plot(kind="line")
for x in data[data['Resource Cost']>5000]['Simulation No'].values:
    plt.axvline(x, color='red', linestyle=':', linewidth=2)
    plt.xticks(np.arange(0,120, step= 24), (data['Block'][0], data['Block'][24][0], data['Block'][48][0], data['Block'][72][0], data['Block'][96][0]))

N = range(5)
cumdev = 0
for n in N:
    cumdev = cumdev + df.diff()[data[data['Resource Cost']>5000]['Simulation No'].values + n].sum()

print(cumdev)
plt.title("Smoothed")
plt.ylabel("Teak share")
plt.xlabel("Trials")
plt.show()

Here, the df is calculated same as above but without the 'diff()'. The 'diff()' is done when calculating the cumdev. Yields: plot - plot of teak share

The red lines indicate the case where they face a budgetary overrun. Even though the cumdev comes out to be same in both cases, the plot are different. It is not clear to me why that would be the case. Please Help?

I see you only moved the `diff()` function to a separate assignment statement in Scenario 2. In that case, the final `df` results should be equal, as you are just applying that function to the same Pandas series in different places. Can you provide more information about the differences you are seeing and ideally the `teak-share` data you created on the first line in both scenarios? — AlexK, Mar 30 '19 at 07:06
Please update the question to conform with these SO guidelines: [Minimal, Complete, and Verifiable example](https://stackoverflow.com/help/mcve) — AlexK, Mar 30 '19 at 07:27
Also it might interest you to have a look at [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) — anky, Mar 30 '19 at 13:08
I see you edited the question. In the future, please add a comment that you did that. And in your second scenario, the `diff()` function does not play a role at all in making this line plot of teak share. In the second scenario, you are just using it to calculate `cumdev`, but `cumdev` does not relate to the plot. The plot is produced with the `df.plot()` command earlier in your code. So that's the difference between plots: in the 1st scenario you are plotting something after applying `diff()`, but in the 2nd scenario you are plotting it without applying `diff()`. — AlexK, Mar 30 '19 at 19:52

Different uses of diff() yield different results. Why and What do they signify?

0 Answers0