Interesting results with duplicate columns in pandas.DataFrame

Question

Can anyone help to explain why I get errors in some actions and not others when there is a duplicate column in a pandas.DataFrame.

Minimal, Reproducible Example

import pandas as pd
df = pd.DataFrame(columns=['a', 'b', 'b'])

If I try and insert a list into column 'a' I get an error about dimension mis-match:

df.loc[:, 'a'] = list(range(5))

Traceback (most recent call last):
...
ValueError: cannot copy sequence with size 5 to array axis with dimension 0

Similar with 'b':

df.loc[:, 'b'] = list(range(5))

Traceback (most recent call last):
...
ValueError: could not broadcast input array from shape (5) into shape (0,2)

However if I insert into an entirely new column, I don't get an error, unless I insert into 'a' or 'b':

df.loc[:, 'c'] = list(range(5))
print(df)

     a    b    b  c
0  NaN  NaN  NaN  0
1  NaN  NaN  NaN  1
2  NaN  NaN  NaN  2
3  NaN  NaN  NaN  3
4  NaN  NaN  NaN  4

df.loc[:, 'a'] = list(range(5))

Traceback (most recent call last):
...
ValueError: Buffer has wrong number of dimensions (expected 1, got 0)

All of these errors disappear if I remove the duplicate column 'b'

Additional information

pandas==1.0.2

It is the duplicate column name - see it asked here: https://stackoverflow.com/questions/27065133/pandas-merge-giving-error-buffer-has-wrong-number-of-dimensions-expected-1-go — Anna Semjén, Dec 11 '20 at 16:43
Yes, I can see it is caused by the duplicate, however, I am curious as to why I can carry out `df.loc[:, 'c'] = list(range(5))` . I'll rephrase my question. — AmyChodorowski, Dec 11 '20 at 16:48
My *guess* is creating a new column first creates a Series, then joins it to the dataframe. Assigned existing columns attempts to put value in the pre-allocated positions — Paul H, Dec 11 '20 at 16:53
I think it is hard to tell without debugging the package's code itself - if you look at your example in `managers.py` `blknos` can't get configured correctly as it is an int instead of an array that would be if your column names are unique - if you look at the `get_blkno_placements` function it expects an array but an int is passed therefore the function can't run... although I ran this in version 1.1.2 and couldn't face this issue — Anna Semjén, Dec 11 '20 at 17:16
I really asked out of curiosity, to help better understand what going on under the hood, so I can use `pandas.DataFrame` better. Obviously, it is never advisable to have duplicate columns. — AmyChodorowski, Dec 11 '20 at 17:32
@AmyChodorowski - How working `df['a'] = list(range(5))` and `df['b'] = list(range(1,6))` ? — jezrael, Dec 15 '20 at 11:14
@jezrael interestingly both `df['a']` and `df.a` works fine. — Leonardus Chen, Dec 16 '20 at 11:45
This issue still exists in 1.3.0 - although the error message is slightly different `ValueError: cannot copy sequence with size 5 to array axis with dimension 0`. I guess I'll open an issue for this. — Leonardus Chen, Dec 16 '20 at 12:27
For anyone interested to track the issue: https://github.com/pandas-dev/pandas/issues/38521 — Leonardus Chen, Dec 16 '20 at 13:18
Does this answer your question? [Pandas merge giving error "Buffer has wrong number of dimensions (expected 1, got 2)"](https://stackoverflow.com/questions/27065133/pandas-merge-giving-error-buffer-has-wrong-number-of-dimensions-expected-1-go) — Lydia van Dyke, Jan 02 '21 at 17:15
When you give `df.a = range(5)` it does not give an error. Similarly, if you give `df.b = range(5)` it does not give error. It gives error only when you use .loc or iloc. This must because we have duplicate index for b. When I give df.axes it gives me `[Index([], dtype='object'), Index(['a', 'b', 'b'], dtype='object')]`. — Joe Ferndz, Jan 10 '21 at 06:19

score 1 · Answer 1 · answered Jan 21 '21 at 14:34

1

Why use loc and not just:

df['a'] = list(range(5))

This gives no error and seems to produce what you need:

a   b   b
0   NaN NaN 
1   NaN NaN 
2   NaN NaN 
3   NaN NaN 
4   NaN NaN

same for creating column c:

df['c'] = list(range(5))

answered Jan 21 '21 at 14:34

Janneman

343
3
13

Interesting results with duplicate columns in pandas.DataFrame

1 Answers1