6

I'm trying to keep of a copy of a pandas DataFrame, so that I can modify it while saving the original. But when I modify the copy, the original dataframe changes too. Ex:

df1=pd.DataFrame({'col1':['a','b','c','d'],'col2':[1,2,3,4]})
df1

    col1    col2
    a       1
    b       2
    c       3
    d       4

df2=df1
df2['col2']=df2['col2']+1
df1

    col1    col2
    a       2
    b       3
    c       4
    d       5

I set df2 equal to df1, then when I modified df2, df1 also changed. Why is this and is there any way to save a "backup" of a pandas DataFrame without it being modified?

Henry Ecker
  • 34,399
  • 18
  • 41
  • 57
Dan Lo
  • 201
  • 3
  • 5
  • 6
    It is because you are just making `df2` a synonym for `df1`. They refer to the same object. To change that, I believe you could do `df2 = df1.copy()`. – zondo Feb 27 '16 at 03:00
  • 2
    This is a Python question and has nothing to do with Pandas. When you do your assignment, you get a pointer to the same object. You can confirm this by typing in your IDE `id(df2)` and `id(df1)`, noting that the values are the same (`id` returns the memory location of the object referenced by the variable). You can do the same with lists. `list_1 = [1, 2]` `list_2 = list_1` `list_2[0] = 10` `>>> list_1` returns [10, 2] – Alexander Feb 27 '16 at 04:00
  • http://nedbatchelder.com/text/names.html might help you with some relevant understanding – Mike Graham Feb 28 '16 at 06:22

3 Answers3

17

This is much deeper than dataframes: you are thinking about Python variables the wrong way. Python variables are pointers, not buckets. That is to say, when you write

>>> y = [1, 2, 3]

You are not putting [1, 2, 3] into a bucket called y; rather you are creating a pointer named y which points to [1, 2, 3].

When you then write

>>> x = y

you are not putting the contents of y into a bucket called x; you are creating a pointer named x which points to the same thing that y points to. Thus:

>>> x[1] = 100
>>> print(y)
[1, 100, 3]

because x and y point to the same object, modifying it via one pointer modifies it for the other pointer as well. If you'd like to point to a copy instead, you need to explicitly create a copy. With lists you can do it like this:

>>> y = [1, 2, 3]
>>> x = y[:]
>>> x[1] = 100
>>> print(y)
[1, 2, 3]

With DataFrames, you can create a copy with the copy() method:

>>> df2 = df1.copy()
Henry Ecker
  • 34,399
  • 18
  • 41
  • 57
jakevdp
  • 77,104
  • 11
  • 125
  • 160
3

You need to make a copy:

df2 = df1.copy()

df2['col2'] = df2['col2'] + 1
print(df1)

Output:

  col1  col2
0    a     1
1    b     2
2    c     3
3    d     4

You just create a second name for df1 with df2 = df1.

Mike Müller
  • 82,630
  • 20
  • 166
  • 161
0

When you set a data frame equal to another it keeps the same location for its data in the computer's memory. This means if you change one value in the new data frame it will change that value in the old one. To fix this you should make a copy of it instead of just making it equal to the original. Example : df2 = df1.copy()

Adil Ras
  • 5
  • 3