13

by = "B" block has duplicated indices both in case1 and case2,

why case1 work but case2 does not.

case1

df1 = pd.DataFrame({"a":[0,100,200],  "by":["A","B","B"]}, index=[0,1,1])
df1.groupby("by").diff()  
# result is okay

case2

df2 = pd.DataFrame({"a":[0,100,200],  "by":["C","B","B"]}, index=[0,1,1])
df2.groupby("by").diff()  
# throws ValueError: cannot reindex from a duplicate axis
Ch3steR
  • 20,090
  • 4
  • 28
  • 58
junliang
  • 131
  • 3
  • 1
    ```bug``` possibly? – sammywemmy Jun 19 '20 at 15:20
  • Using `sort=False` inside `groupby` does not trigger the error – Terry Jun 19 '20 at 15:27
  • 1
    if you pass ```sort=False``` it works fine. which leads me to think for the first df, groupby sorts alphabetically and the index is 0 followed by 1s - somehow the program does not see that as duplicates. However, the second df, it groups, sorts alphabetically, but B comes before C. the program sees two 1s, two Bs and triggers error. if you set ```sort=False``` it starts with C, index starts at 0, program does not see dupes and all is well. why? no idea. just musings about possible reason for the error. I'd suggest you file an issue though. – sammywemmy Jun 19 '20 at 15:27
  • 10
    This bug is fixed in `v.1.1.0`. It works fine in `v.1.1.0`, raises an error in `v.1.0.3` – Ch3steR Jun 19 '20 at 15:31
  • 1
    I made a PR to pandas recently so I have pandas master `v1.1.0` and ran both of them in `v.1.1.0` both worked fine and gave expected results. I couldn't find the PR or issue related to the above-mentioned issue, once I find it i'll leave the link here. – Ch3steR Jun 19 '20 at 15:33
  • good to know. thanks @Ch3steR. waiting patiently for ```v1.1.0``` release - primarily because of the included ability to handle nulls in groupby operations – sammywemmy Jun 19 '20 at 15:40
  • 1
    @sammywemmy Yes, `v1.1.0` is going to be a lot better. Check [`whats new in 1.1.0`](https://pandas.pydata.org/pandas-docs/dev/whatsnew/v1.1.0.html#contributors) and 1st bug fix under groupby/ resample/ rolling is the fix for the issue in the question. – Ch3steR Jun 19 '20 at 15:49
  • This is fixed [`GitHub issue 30667`](https://github.com/pandas-dev/pandas/issues/30667) and [`GitHub pullrequest 30679`](https://github.com/pandas-dev/pandas/pull/30679) – Ch3steR Jun 19 '20 at 15:56
  • Both codes have the same result for me, I am using v1.1.3 – Mehdi Golzadeh Oct 24 '20 at 17:37
  • I’m voting to close this question because it has been fixed in later version of pandas. – Scott Boston Apr 17 '21 at 14:07
  • Thus update your Pandas version like Scott Boston suggests or see my answer below and set your sort property to False – Laurent B. May 06 '21 at 01:30
  • try using pd.IntervalIndex to set index intervals see(https://www.w3resource.com/python-exercises/pandas/index/pandas-indexing-exercise-5.php#:~:text=IntervalIndex%20represents%20an%20Index%20of,closed%20on%20the%20same%20side.) – Golden Lion Apr 08 '21 at 14:44

3 Answers3

1

Your problem is solved by turning off the sort property of groupby.

df1 = pd.DataFrame({"a":[0,100,200],  "by":["C","B","B"]}, index=[0,1,1])
df1.groupby("by", sort=False).diff()
print(df1)

Result:

     a by
0    0  C
1  100  B
1  200  B

Explanation:

Even if you "cannot reindex from a duplicate axis", Pandas tries to do it by assigning a rank to letters by their alphabetical order when the sort property is activated, for instance :

A ---> 1

B ---> 2

B ---> 3

Even if we have 2 B's the incrementation is possible by considering the second B comes logically after the first B. For example the chunks of code below works perfectly:

import pandas as pd

# THE CODE BELOW WORKS PERFECTLY
df1 = pd.DataFrame({"a":[0,100,90],  "by":["A","B","B"]}, index=[0,1,1])
df1.groupby("by").diff()
print(df1)

df1 = pd.DataFrame({"a":[0,100,90],  "by":["B","C","C"]}, index=[0,1,1])
df1.groupby("by").diff()
print(df1)

df1 = pd.DataFrame({"a":[0,100,90],  "by":["C","D","D"]}, index=[0,1,1])
df1.groupby("by").diff()
print(df1)

Because D comes after C, C comes after B and so on... Pandas tries to find a logic

what is illogical considering the alphabetical order is this: DCC ---> You could not assign 1 to D thus 2 to C.

Chunks of code below generate errors:

# EVERY CHUNK OF CODE BELOW GENERATES AN ERROR
df1 = pd.DataFrame({"a":[0,100,90],  "by":["B","A","A"]}, index=[0,1,1])
df1.groupby("by").diff()
print(df1)
# builtins.ValueError: cannot reindex from a duplicate axis

df1 = pd.DataFrame({"a":[0,100,90],  "by":["D","C","C"]}, index=[0,1,1])
df1.groupby("by").diff()
print(df1)
# builtins.ValueError: cannot reindex from a duplicate axis

df1 = pd.DataFrame({"a":[0,100,90],  "by":["E","D","D"]}, index=[0,1,1])
df1.groupby("by").diff()
print(df1)
# builtins.ValueError: cannot reindex from a duplicate axis

To go further: Let's consider these 2 chunks and their results:

df1 = pd.DataFrame({"a":[0,100,200],  "by":["E","D","F"]}, index=[0,1,1])
df1.groupby("by").diff()
print(df1)
# builtins.ValueError: cannot reindex from a duplicate axis

with only a change on index...

df1 = pd.DataFrame({"a":[0,100,200],  "by":["E","D","F"]}, index=[0,1,2])
df1.groupby("by").diff()
print(df1)

#      a by
# 0    0  E
# 1  100  D
# 2  200  F

Even if EDF is not the alphabetical order, Pandas seems to go for a sort logic by using the index... index is 011 in the first case with no sorting logic that's not the case with 012

In conclusion you have to desactivate sorting by turning it to False property to prevent Pandas sorting attempts

Laurent B.
  • 1,653
  • 1
  • 7
  • 16
0

I assume .diff() processing sorted data. You can try this:

df2 = pd.DataFrame({"a":[0,100,200],  "by":["C","B","B"]}, index=[0,1,1])
df2.sort_values(by='by').groupby("by").diff()
Alihan ÖZ
  • 173
  • 1
  • 7
0

Pandas would not know how to perform when there are duplicate indices. In that case, one way we tell them is to reset index:

df2 = pd.DataFrame({"a":[0,100,200],  "by":["C","B","B"]}, index=[0,1,1])
df2.reset_index(drop=True).groupby("by").diff() 

Result:

     a
0   NaN
1   NaN
2   100.0

Only if you are certain that your series are aligned by position, you can call reset_index on each dataframe.

The AG
  • 672
  • 9
  • 18