0

I have the following pandas dataframe (pandas 0.20.2, python 3.6.2):

#    df=pd.DataFrame([['abc00010                    Pathway'],['abc00020                    Pathway']], columns=["ENTRY"])
df3=pd.DataFrame(columns=["ENTRY"])
df3.loc[:,"ENTRY"]=[list(['abc00010                    Pathway']),list(['abc00020                    Pathway'])]


df["ENTRY2"]=df.loc[:,"ENTRY"]  
df["ENTRY3"]=df.loc[:,"ENTRY"]  
df["ENTRY4"]=df.loc[:,"ENTRY"]  
df["ENTRY5"]=df.loc[:,"ENTRY"]  
df["ENTRY6"]=df.loc[:,"ENTRY"]  


dfcleaner=re.compile(r"\W+?Pathway")  
df.loc[:,"ENTRY"]=df.loc[:,"ENTRY"].apply(str)
df.loc[:,"ENTRY"].replace(dfcleaner,"", inplace=True, regex=True)  

df.loc[:,"ENTRY2"]=df.loc[:,"ENTRY2"].apply(str)
df.loc[:,"ENTRY2"].replace(dfcleaner,"")

df.loc[:,"ENTRY3"].replace(dfcleaner,"", inplace=True, regex=True)
df["ENTRY4"]=df.loc[:,"ENTRY4"].str.replace(dfcleaner,"")#>NANA

df.loc[:,"ENTRY5"]=df.loc[:,"ENTRY5"].replace(dfcleaner,"", inplace=True, regex=True)
df.loc[:,"ENTRY6"]=df.loc[:,"ENTRY6"].replace(dfcleaner,"", regex=True)

    ENTRY   ENTRY2  ENTRY3  ENTRY4  ENTRY5  ENTRY6  
0   ['abc00010']    ['abc00010                    Pathway'] ['abc00010                    Pathway'] nan None    ['abc00010                    Pathway']
1   ['abc00020']    ['abc00020                    Pathway'] ['abc00020                    Pathway'] nan None    ['abc00020                    Pathway']

I expected ENTRY2 not to be changed, as well as ENTRY3 and ENTRY6 since they are not strings nor converted to it, or ENTRY5 as replacing in place will return none.

What I did not expect was the ENTRY4 behavior with the string accessor. Could you explain it to me? Can't decide if it is a bug or not, it has not yet been reported if it is one...

EDITED the code above as the first one did not give a df exactly similar to what I wanted/what matches the results in my code

Ando Jurai
  • 1,003
  • 2
  • 14
  • 29
  • @terrya That's obviously not a duplicate, since as stated by pandas doc, replace using a regex uses re.sub under the hood, plus the question there doesn't use a regex, while I am using it. This flag is inappropriate, and it is even more than no solution is the supposedly duplicated question can apply to this. This is a question which is specific to pandas, not mere strings. – Ando Jurai Aug 07 '17 at 08:22

1 Answers1

1

I expected ENTRY2 not to be changed, as well as ENTRY3 and ENTRY6 since they are not strings nor converted to it

All your columns are of object (string) dtype:

In [11]: df.dtypes
Out[11]:
ENTRY     object
ENTRY2    object
ENTRY3    object
ENTRY4    object
ENTRY5    object
ENTRY6    object
dtype: object

ENTRY5 as replacing in place will return none

that's how inplace=True works. You either assign back returned DF when using inplace=False (default):

df.loc[:,"ENTRY5"]=df.loc[:,"ENTRY5"].replace(dfcleaner,"", regex=True)

or update in place - in this case None is returned, so we should not assign it back:

df.loc[:,"ENTRY5"].replace(dfcleaner,"", inplace=True, regex=True)

What I did not expect was the ENTRY4 behavior with the string accessor.

I could not reproduce ENTRY4 "problem" using your code (Pandas 0.20.1):

In [16]: df
Out[16]:
      ENTRY                               ENTRY2    ENTRY3    ENTRY4 ENTRY5    ENTRY6
0  abc00010  abc00010                    Pathway  abc00010  abc00010   None  abc00010
1  abc00020  abc00020                    Pathway  abc00020  abc00020   None  abc00020
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • Thanks for pointing out my messing with the sample code and for the explanation. I edited it so that it matches my original code, because I was not mistaken about the fact that I indeed use lists in my df for the beginning. – Ando Jurai Aug 07 '17 at 10:24