8

Can anyone help to explain why I get errors in some actions and not others when there is a duplicate column in a pandas.DataFrame.

Minimal, Reproducible Example

import pandas as pd
df = pd.DataFrame(columns=['a', 'b', 'b'])

If I try and insert a list into column 'a' I get an error about dimension mis-match:

df.loc[:, 'a'] = list(range(5))

Traceback (most recent call last):
...
ValueError: cannot copy sequence with size 5 to array axis with dimension 0

Similar with 'b':

df.loc[:, 'b'] = list(range(5))

Traceback (most recent call last):
...
ValueError: could not broadcast input array from shape (5) into shape (0,2)

However if I insert into an entirely new column, I don't get an error, unless I insert into 'a' or 'b':

df.loc[:, 'c'] = list(range(5))
print(df)

     a    b    b  c
0  NaN  NaN  NaN  0
1  NaN  NaN  NaN  1
2  NaN  NaN  NaN  2
3  NaN  NaN  NaN  3
4  NaN  NaN  NaN  4

df.loc[:, 'a'] = list(range(5))

Traceback (most recent call last):
...
ValueError: Buffer has wrong number of dimensions (expected 1, got 0)

All of these errors disappear if I remove the duplicate column 'b'


Additional information

pandas==1.0.2

AmyChodorowski
  • 392
  • 2
  • 14
  • It is the duplicate column name - see it asked here: https://stackoverflow.com/questions/27065133/pandas-merge-giving-error-buffer-has-wrong-number-of-dimensions-expected-1-go – Anna Semjén Dec 11 '20 at 16:43
  • Yes, I can see it is caused by the duplicate, however, I am curious as to why I can carry out `df.loc[:, 'c'] = list(range(5))` . I'll rephrase my question. – AmyChodorowski Dec 11 '20 at 16:48
  • 2
    My *guess* is creating a new column first creates a Series, then joins it to the dataframe. Assigned existing columns attempts to put value in the pre-allocated positions – Paul H Dec 11 '20 at 16:53
  • I think it is hard to tell without debugging the package's code itself - if you look at your example in `managers.py` `blknos` can't get configured correctly as it is an int instead of an array that would be if your column names are unique - if you look at the `get_blkno_placements` function it expects an array but an int is passed therefore the function can't run... although I ran this in version 1.1.2 and couldn't face this issue – Anna Semjén Dec 11 '20 at 17:16
  • Why do you need to duplicate the column name? – Nour-Allah Hussein Dec 11 '20 at 17:23
  • I really asked out of curiosity, to help better understand what going on under the hood, so I can use `pandas.DataFrame` better. Obviously, it is never advisable to have duplicate columns. – AmyChodorowski Dec 11 '20 at 17:32
  • what pandas version is this? – Leonardus Chen Dec 14 '20 at 20:25
  • `pandas==1.0.2` – AmyChodorowski Dec 15 '20 at 08:44
  • 1
    @AmyChodorowski - How working `df['a'] = list(range(5))` and `df['b'] = list(range(1,6))` ? – jezrael Dec 15 '20 at 11:14
  • @jezrael interestingly both `df['a']` and `df.a` works fine. – Leonardus Chen Dec 16 '20 at 11:45
  • 1
    This issue still exists in 1.3.0 - although the error message is slightly different `ValueError: cannot copy sequence with size 5 to array axis with dimension 0`. I guess I'll open an issue for this. – Leonardus Chen Dec 16 '20 at 12:27
  • 1
    For anyone interested to track the issue: https://github.com/pandas-dev/pandas/issues/38521 – Leonardus Chen Dec 16 '20 at 13:18
  • Does this answer your question? [Pandas merge giving error "Buffer has wrong number of dimensions (expected 1, got 2)"](https://stackoverflow.com/questions/27065133/pandas-merge-giving-error-buffer-has-wrong-number-of-dimensions-expected-1-go) – Lydia van Dyke Jan 02 '21 at 17:15
  • When you give `df.a = range(5)` it does not give an error. Similarly, if you give `df.b = range(5)` it does not give error. It gives error only when you use .loc or iloc. This must because we have duplicate index for b. When I give df.axes it gives me `[Index([], dtype='object'), Index(['a', 'b', 'b'], dtype='object')]`. – Joe Ferndz Jan 10 '21 at 06:19

1 Answers1

1

Why use loc and not just:

df['a'] = list(range(5))

This gives no error and seems to produce what you need:

a   b   b
0   NaN NaN 
1   NaN NaN 
2   NaN NaN 
3   NaN NaN 
4   NaN NaN 

same for creating column c:

df['c'] = list(range(5))
Janneman
  • 343
  • 3
  • 13