1

I'm sorry if this question is already answered, but I truly don't know the different names of either of those (object or list or array?), so I am still confused.

I'm just curious as a follow up from this question.

Pandas: Getting "TypeError: only integer scalar arrays can be converted to a scalar index" while trying to merge data frames

And the answer by Ilyas.

Why [[list]] resulted in an error

only integer scalar arrays can be converted to a scalar index

but [list] doesn't?

heilala
  • 770
  • 8
  • 19
Lulu Firdaus
  • 53
  • 1
  • 8
  • 1
    `['a']` is a list of strings (presumably) and `[['a']]` is a list of a list of strings. Why would one be a valid substitute for the other? If I index into that list expecting a string and instead get a list then a TypeError is the most likely and easily debugged outcome. – Jared Smith Jul 21 '20 at 13:18
  • I see. Thank you! I've only realised this now. – Lulu Firdaus Jul 21 '20 at 13:34

1 Answers1

2

The relevant code you talk about in the linked question is:

df1 = pd.DataFrame({'a': [1, 2]})
df2 = pd.DataFrame({'b': [3, 1]})

df1.columns = [['b']] # WRONG
df1.columns = ['b']   # CORRECT

df1.merge(df2, on='b')

df.columns must be a list of column labels1, each representing the name of a column. In the wrong version, you're setting it to be a one-element list whose element is itself a list (NOT a string), and thus the error.


1 column labels is straight from the documentation of DataFrame.columns, strings are a type of valid values (although not the only one, see the comments below). Lists, on the other hand, generate a MultiIndex (try print (df1.columns) after the "wrong" version), which causes problems later in the call to merge.

GPhilo
  • 18,519
  • 9
  • 63
  • 89
  • 2
    *`df.columns` must be a list of strings, each representing the name of a column.* Not true...Try `df1.columns = [('a',)]` – Ch3steR Jul 21 '20 at 13:25
  • 1
    Fair point, I wasn't quite sure whether other hashable types would be allowed. The [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.columns.html) talks about "column labels", but I guess that is broader than strings. I'll adapt the answer – GPhilo Jul 21 '20 at 13:27
  • 1
    Thank you for the clear explanation! I get it now. ^_^ – Lulu Firdaus Jul 21 '20 at 13:35
  • 1
    `df1.columns = [['b']]` this doesn't give an error it converts `[['b']]` to `MultiIndex`, This error is related to `merge` **not** `df.columns`. – Ch3steR Jul 21 '20 at 13:49
  • 1
    @Ch3steR isn't `merge` failing because it uses `['b']` under the hood to try indexing in the dataframe (is `columns` a property with a setter that does the conversion to multiindex, or are we overwriting a list and facing consequences later on? I couldn't find any information about this)? Setting `columns` is not in itself raising the error, but it's the source of it. – GPhilo Jul 21 '20 at 13:52
  • After digging in the docs in `pandas/merge.py` `_MergeOperation` class has a function `get_results` which calls `_get_merge_keys` which inturn calls `_get_label_or_level_values` which calls `xs` on `df1`(in this case) is what is generating the error. Try this `df1.columns = [['b']]; df1.xs('b',axis=1)` getting the same error. (Same traceback with same func calls) – Ch3steR Jul 21 '20 at 14:14
  • @Ch3steR Indeed. I looked for the setter of `columns`, which I didn't find in `pandas/core/frame.py` and I can't quite find anywhere else. Either way, setting both `df1` and `df2` to have MultiIndexes still causes problems on `merge` (I tried with multiple versions of the `on` attribute, ofc). My take on this is: keep the columns simple. The docs for this could definitely use some additional love. – GPhilo Jul 21 '20 at 14:23
  • 1
    `df1.columns =[['b']]; df2.columns = [['b']]; df1.merge(df2, on = [('b',)],how='outer')` gives correct output. Since OP is converting `df1.columns` to `MultiIndex` and `df2.columns` is `Index`....`df.xs` is raising error i guess. – Ch3steR Jul 21 '20 at 14:33