0

The following script prints the same input variable input_df twice at the end - before and after df_lower has been called:

import pandas as pd

def df_lower(df):
    cols = ['col_1']
    df[cols] = df[cols].applymap(lambda x: x.lower())
    return df

input_df = pd.DataFrame({
    'col_1': ['ABC'],
    'col_2': ['XYZ']
})

print(input_df)
processed_df = df_lower(input_df)
print(input_df)

The output shows that input_df changes:

  col_1 col_2
0   ABC   XYZ
  col_1 col_2
0   abc   XYZ

Why is input_df modified?

Why isn't it modified when full input_df (no column indexing) is processed?

def df_lower_no_indexing(df):
    df = df.applymap(lambda x: x.lower())
    return df
josoler
  • 1,393
  • 9
  • 15
Marek Grzenkowicz
  • 17,024
  • 9
  • 81
  • 111
  • Because by taking the input_df into the function and using df[cols] = blabla, you are making the new variable and the old , point to the same place in memory – FabioSpaghetti Jul 29 '19 at 15:01

1 Answers1

1

You are assinging to a slice of the input dataframe. In the no indexing case, you are just assigning a new value to the local variable df:

df = df.applymap(lambda x: x.lower())

Which creates a new variable, leaving the input as is.

Conversely, in the first case, you are assigning a value to a slice of the input, hence, modifying the input itself:

df[cols] = df[cols].applymap(lambda x: x.lower())

With a simple change, you can create a new variable as well in the first case:

def df_lower(df):
    cols = ['col_1']
    df = df[[col for col in df.columns if col not in cols]]
    df[cols] = df[cols].applymap(lambda x: x.lower())
    return df
josoler
  • 1,393
  • 9
  • 15