0

When assigning a new column to one dataframe in the list, it copies it to all other dataframes. Example:

In [219]: a = [pd.DataFrame()]*2
In [220]: a[0]['a'] = [1,2,3]
In [221]: a[1]
Out[221]: 
   a
0  1
1  2
2  3

Is this a bug? And what can I do to prevent it?

Thanks!

Yehuda Karlinsky
  • 329
  • 1
  • 2
  • 4
  • 1
    That's known behavior when you try to initialize list with `[]*n`. Use `a = [pd.DataFrame() for i in range(2)]` to initialize the list instead. – Psidom Nov 16 '16 at 14:16
  • Thanks Psidom! Out of curiosity, to you know why this happens? – Yehuda Karlinsky Nov 16 '16 at 14:19
  • 1
    Possible duplicate of ["Least Astonishment" and the Mutable Default Argument](http://stackoverflow.com/questions/1132941/least-astonishment-and-the-mutable-default-argument) – Zeugma Nov 16 '16 at 14:21
  • 2
    By doing that you are creating a list of references to the same object so whenever you modify one of them, others change at the same time. See http://stackoverflow.com/questions/240178/python-list-of-lists-changes-reflected-across-sublists-unexpectedly. – Psidom Nov 16 '16 at 14:23

2 Answers2

0

The answer is because when you define a list with that syntax

x = [something]*n

You end up with a list, where each item is THE SAME something. It doesn't create copies, it references the SAME object:

>>> import pandas as pd
>>> a=pd.DataFrame()
>>> g=[a]*2
>>> g
1: [Empty DataFrame
Columns: []
Index: [], Empty DataFrame
Columns: []
Index: []]
>>> id(g[0])
4: 129264216L
>>> id(g[1])
5: 129264216L

The comments are pointing to some useful examples which you should read through and grok.

To avoid it in your situation, just use another way of instantiating the list:

>>> map(lambda x: pd.DataFrame(),range(2))
6: [Empty DataFrame
Columns: []
Index: [], Empty DataFrame
Columns: []
Index: []]
>>> [pd.DataFrame() for i in range(2)]
7: [Empty DataFrame
Columns: []
Index: [], Empty DataFrame
Columns: []
Index: []]
>>> 
greg_data
  • 2,247
  • 13
  • 20
-1

EDIT: I now see that there is an explanation for this in the replies^

I don't understand what this is caused by just yet, but you can get around it by defining your dataframes separately prior to putting them in a list.

In [2]: df1 = pd.DataFrame()
In [3]: df2 = pd.DataFrame()
In [4]: a = [df1, df2]
In [5]: a[0]['a'] = [1,2,3]
In [6]: a[0]
Out[6]:
   a
0  1
1  2
2  3

In [7]: a[1]
Out[7]:
Empty DataFrame
Columns: []
Index: []
guzman
  • 177
  • 2
  • 11