0

I'm learning to use pipelines and made a pretty simple pipeline with a FunctionTransformer to add a new column, an ordinal encoder and a LinearRegression model.

But Turns out I'm getting SettingwithCopy when I run the pipeline and I isolated the issue to the FunctionTransformer.

Here is the Code, I omitted all the unnecessary code (like ordinal enoder and regressor in pipeline) -

def weekfunc(df):
    df['date'] = pd.to_datetime(df.loc[:,'date'])
    df['weekend'] = df.loc[:, 'date'].dt.weekday
    df['weekend'].replace(range(5), 0, inplace = True)
    df['weekend'].replace([5,6], 1, inplace = True)
    return df

get_weekend = FunctionTransformer(weekfunc)

pipe = Pipeline([
    ('weekend transform', get_weekend),
])

pipe.transform(X_train)

This gives me the follow error -

/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:12: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if sys.path[0] == '':
/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:13: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  del sys.path[0]
/opt/conda/lib/python3.7/site-packages/pandas/core/generic.py:6619: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return self._update_inplace(result)

This is weird since I can do the same thing without the FunctionTransformer and not get the error.

I'm truly confused over here, so any help is appreciated

default-303
  • 361
  • 1
  • 4
  • 15

1 Answers1

0

This is warning you that you may not have necessarily done what you needed to do. You are trying to access and update a view. Whereas the view has been updated, you may not necessarily have updated the original df. Thats the issue.

Pandas warns you because there is potential for big errors especially when you are dealing with big datasets.

Lets demonstrate;

df=pd.DataFrame({'city':['San Fransisco', 'Nairobi'], 'score':[123,95]})

Lets subset and add 2 to score if city is Nairobi

df['score']=df.loc[df['city']=='Nairobi','score']+2

Outcome

     city  score
0  San Fransisco    NaN
1        Nairobi   97.0

You realize though it worked the outcome nulled San Fransisco. This is what the warning is all about

What's the right way to do it? The right way is to mask what you do not need to update. One way to do this is what the warning is recommending. Select cell to be updated using the lo accessor.

df.loc[df['city']=='Nairobi','score']=df.loc[df['city']=='Nairobi','score']+2

Outcome

    city          score
0  San Fransisco    123
1        Nairobi     97
wwnde
  • 26,119
  • 6
  • 18
  • 32
  • I'm kinda confused, like when I'm i supposed to use df['column name'] and when to use df.loc[:, 'column name'] ? – default-303 Jan 03 '22 at 13:37
  • They are the same, they both are series. Try , `type(df['city'])` and `type(df.loc[:,'city'])`. `df.loc[:,'city']` just means all rows in the column city so as `df['city']`. This changes however, when we replace `:` with a boolean selection. try for instance `df.loc[df['score']==95,'city']` that gives you the column `city` with rows where the `score` is `95`. Does that help. What I would advice is you read documentation as you meet difficulty to learn as you go. Happy coding man!! https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html – wwnde Jan 03 '22 at 14:00
  • In your case, you dint need loc, because that is confusing pandas. Pandas imagines you have created a view. you either use, `df['date'] = pd.to_datetime(df['date'])` or `df.loc[:,'date'] = pd.to_datetime(df.loc[:,'date'])`. Let me know how you go. Happy to help – wwnde Jan 03 '22 at 14:07