In Python with Pandas, I have a function to change the index of DataFrame. But, it also changes the index of the original DataFrame

Question

I have a following analysis.py file. The function group_analysis changes the datetime index of df_input by the Count column of df_input

# analysis.py
import pandas as pd

def group_analysis(df_input):
    df_input.index = df_input.index - pd.to_timedelta(df_input.Count, unit = 'days')
    df_ouput = df_input.sort_index()

    return df_ouput

def test(df):
    df = df + 1
    return df

And I have a following dataframe.

x = pd.DataFrame(np.arange(1,14), index = pd.date_range('2020-01-01', periods = 13, freq= 'D'), columns = ['Count'])

            Count
2020-01-01      1
2020-01-02      2
2020-01-03      3
2020-01-04      4
2020-01-05      5
2020-01-06      6
2020-01-07      7
2020-01-08      8
2020-01-09      9
2020-01-10     10
2020-01-11     11
2020-01-12     12
2020-01-13     13

When I run the following code,

import analysis
y = analysis.group_analysis(x)

the datetime index of both x and y are changed (and so, x.equals(y) is True). Why group_analysis changes the both the input and output datetime index? And how can I make it to change only the datetime index of y (but not x)?

However, when running the following code, x does not change (so, x.equals(y) is True)

import analysis
y = analysis.test(x)

EDIT: analysis.test(df) is added.

Try `y = analysis.group_analysis(x.copy())`? This happens because you are passing reference of your original dataframe to the function. @david78 — Vishnudev Krishnadas, Jul 13 '20 at 07:39
Thanks for the help:). I do not have this issue when another function changes only values of x, but not the datetime index of x. For example, def test(): df =df+1 return df. Is there a reason the issue happens only when a function changes the index of the dataframe? — david78, Jul 13 '20 at 08:40
The first line of your file assigns to the index which is a property of the input dataframe. Thus not creating a copy of the dataframe itself. When you do an addition it return a copy of the dataframe after addition. To demonstrate this, try using a dataframe function with inplace argument set to true. You'll notice the change. @david78 — Vishnudev Krishnadas, Jul 13 '20 at 11:08
Thank you for the help. Please find the edited original posting, where I added a new function 'test(df)'. I am not clear why the issue is not found for 'test(df)' — david78, Jul 13 '20 at 11:40

score 1 · Answer 1 · answered Jul 13 '20 at 07:47

1

The reason for this behaviour is because when calling group_analysis you are not passing a copy of the dataframe to the function, but rather a reference to the original data in the memory of the computer. Therefore, if you modify the data behind it, the original data (which is the same) will also be modified.

For a very good explanation refer to https://robertheaton.com/2014/02/09/pythons-pass-by-object-reference-as-explained-by-philip-k-dick/.

To prevent this create a copy of the data when you enter the function:

...
def group_analysis(df):
    df_input = df.copy()
    ...

answered Jul 13 '20 at 07:47

divingTobi

2,044
10
25

Thanks for the help:). I do not have this issue when another function changes only values of x, but not the datetime index of x. For example, def test(): df =df+1 return df. Is there a reason the issue happens only when a function changes the index of the dataframe? – david78 Jul 13 '20 at 08:40
I believe it depends on the type of the varibles. When using simple types, the calling variable is not modified. Not sure though what your example is, should `df` be a parameter of test? – divingTobi Jul 13 '20 at 08:51
Thank you for the help. Please find the edited original posting, where I added a new function 'test(df)'. I am not clear why the issue is not found for 'test(df)' – david78 Jul 13 '20 at 11:41

score 0 · Accepted Answer · answered Jul 13 '20 at 12:16

0

When you pass a dataframe to a function, it passes the dataframe reference. So, any in-place change that you do to the dataframe, It will reflect in the passed dataframe.

But in the case of your test function, the addition returns a copy of the dataframe in-memory. How do I know that? Just print the memory reference id of the variable before and after the operation.

>>> def test(df):
...     print(id(df))
...     df = df + 1
...     print(id(df))
...     return df
... 
>>> test(df)
139994174011920
139993943207568

Notice the change? This means its reference has been changed. Hence not affecting the original dataframe.

answered Jul 13 '20 at 12:16

Vishnudev Krishnadas

10,679
2
23
55

Thank you for clarification!!. But I do not know how to figure out which returns a copy of dataframe and which does not. Is there a specific rule? The first line of 'group_analysis(df_input)' subtracts 'pd.to_timedelta(df_input.Count, unit = 'days')', but it does not returns a copy. – david78 Jul 13 '20 at 12:21
Why do you wanna know that? Python passes into function always as a reference. Unless you want the original dataframe changed, you pass a copy. Each pandas function behaves differently. As I said, you need think about references. – Vishnudev Krishnadas Jul 13 '20 at 12:45
Thanks for the help. I have many functions with dataframes as inputs. As I never thought about this issue in the past, my other functions were generated without considering it. So, I want to know what situations a function does not returns a copy to review my other functions. Again, many thanks. – david78 Jul 13 '20 at 13:58
We too had many dataframe transformation based functions. But we made a copy at start of the process. – Vishnudev Krishnadas Jul 13 '20 at 14:10

In Python with Pandas, I have a function to change the index of DataFrame. But, it also changes the index of the original DataFrame

2 Answers2