12

I have a large DataFrame that I need to clean, as a sample please look at this dataframe:

import pandas as pd

cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4','Suzuki'],
        'Price': ['22000.T','25000.T','27000','.TPX','.NKM1']
        }

df = pd.DataFrame(cars, columns = ['Brand', 'Price'])

print (df)

I want to remove '.T' from the end of the words, and only '.' from the beginning of the rows that contain the.

by the following line of code, I could remove the '.T'

df['Price'].replace('.T', '', regex=True)

but it also removed the 'T' from the '.TPX'

any advice on this is appreciated.

0    22000
1    25000
2    27000
3       PX
4    .NKM1
Name: Price, dtype: object

Also for removing the '.' when I add this line

f['Price'].replace('.', '', regex=True)

I get a different dataframe as what I expected

0    
1    
2    
3    
4    
Name: Price, dtype: object
sophocles
  • 13,593
  • 3
  • 14
  • 33
sam_sam
  • 449
  • 1
  • 5
  • 16

7 Answers7

8

Another way would be to use numpy.where and evaluate your conditions using str.startswith and str.endswith:

import numpy as np

p = df['Price'].str
df['Price'] = np.where(p.startswith('.'),p.replace('.','',regex=True),
                         np.where(p.endswith('.T'),p.replace('.T','',regex=True),p))

This will check whether df['Price'] starts with a . or ends with a .T and replace them.

            Brand  Price
0     Honda Civic  22000
1  Toyota Corolla  25000
2      Ford Focus  27000
3         Audi A4    TPX
4          Suzuki   NKM1
sophocles
  • 13,593
  • 3
  • 14
  • 33
6

Series.str.replace

df['Price'] = df['Price'].str.replace(r'^(?:\.)?(.*?)(?:\.T)?$', r'\g<1>')

Series.str.extract

df['Price'] = df['Price'].str.extract(r'^(?:\.)?(.*?)(?:\.T)?$', expand=False)

            Brand  Price
0     Honda Civic  22000
1  Toyota Corolla  25000
2      Ford Focus  27000
3         Audi A4    TPX
4          Suzuki   NKM1

Regex details:

  • ^ : Assert position at the start of line
  • (?:\.) : Non capturing group which matches the character .
  • ? : Matches the previous non capturing group zero or one time
  • (.*?) : Capturing group which matches any character except line terminators zero or more times but as few times as possible (lazy match)
  • (?:\.T) : Non capturing group which matches .T
  • ? : Matches the previous non capturing group zero or one time
  • $ : Asserts position at the end of the line

See the Regex demo

Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53
  • 1
    @anky Because it would also remove unwanted characters at the start and end..Consider the imaginary word `ABCDDTTT.T` so using `.rstrip(.T)` would result in `ABCDD` that means it would also remove extra `T's` at the end which might be undesirable...Although if that's not a problem in the dataset then we can definitely use `rstrip.` – Shubham Sharma Mar 19 '21 at 14:36
4

You should be able to what you want with anchors and what's called a positive lookbehind.

df['Price'].replace('(?<=.)\.T$', '', regex=True)

With regular expressions, there's special characters that have added functionality. Here, the '$' means ends with. So if you want to just affect stings that end in '.T' you want to add that to the end. The part of the expression that is the lookbehind is '(?<=.)'. The parentheses signify a group.

I don't really know how to explain it other than it's kind of similar to how CSS classes work, which really isn't that great of an example.

The '?<=.' is the actual parameters for the lookbehind, it tells the regex engine to match any character ( the '.' ) before the match that exists outside the group ( '.T' ).

To replace the words starting with '.' is very simple. It's just the opposite anchor,

df['Price'].replace('^\.', '', regex=True)

https://regex101.com/ is a great website to help build your regexes. It will also explain what your regex does.

zelarian
  • 121
  • 1
  • 4
4

You can also use numpy.select:

In [178]: import numpy as np

In [179]: conds = [df.Price.str.endswith('.T'), df.Price.str.startswith('.')]
In [182]: choices = [df.Price.str.replace('.T', '', regex=True), df.Price.str.replace('.', '', regex=True)]

In [189]: df.Price = np.select(conds, choices, default=df.Price)

In [190]: df
Out[190]: 
            Brand  Price
0     Honda Civic  22000
1  Toyota Corolla  25000
2      Ford Focus  27000
3         Audi A4    TPX
4          Suzuki   NKM1
Mayank Porwal
  • 33,470
  • 8
  • 37
  • 58
4

You can match either a dot at the start of the string, or match .T at the end. Then use an empty string in the replacement.

\A\.|\.T\Z

For example

import pandas as pd

cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4','Suzuki'],
        'Price': ['22000.T','25000.T','27000','.TPX','.NKM1']
        }

df = pd.DataFrame(cars, columns = ['Brand', 'Price'])
df['Price'] = df['Price'].replace(r"\A\.|\.T\Z", "", regex=True)
print(df)

Output

            Brand  Price
0     Honda Civic  22000
1  Toyota Corolla  25000
2      Ford Focus  27000
3         Audi A4    TPX
4          Suzuki   NKM1
The fourth bird
  • 154,723
  • 16
  • 55
  • 70
3

I want to explain why you got that result. This is because . has special meaning when used in pattern, re docs list of special characters starts with

. (Dot.) In the default mode, this matches any character except a newline. If the DOTALL flag has been specified, this matches any character including a newline.

So when you mean literal . you need to escape it, consider following example

df = pd.DataFrame({"col1":["3.45"]})
df["unescaped"] = df.col1.replace(r'.','X',regex=True)
df["escaped"] = df.col1.replace(r'\.','X',regex=True)
print(df)

output

   col1 unescaped escaped
0  3.45      XXXX    3X45

Note that I used so called raw-string here, which allows more readable form of escaping characters with special meaning in pattern (without raw-string I would have to write '\\.', consult re docs for more information). If you struggle with regular expression pattern I suggest using regex101.com to get its explanation.

Daweo
  • 31,313
  • 3
  • 12
  • 25
3

Use

>>> df['Price'].str.replace(r'^\.|\.T$', '', regex=True)
0    22000
1    25000
2    27000
3      TPX
4     NKM1

See regex proof.

EXPLANATION

--------------------------------------------------------------------------------
  ^                        the beginning of the string
--------------------------------------------------------------------------------
  \.                       '.'
--------------------------------------------------------------------------------
 |                        OR
--------------------------------------------------------------------------------
  \.                       '.'
--------------------------------------------------------------------------------
  T                        'T'
--------------------------------------------------------------------------------
  $                        before an optional \n, and the end of the
                           string
Ryszard Czech
  • 18,032
  • 4
  • 24
  • 37