3

For iinstance I have a dataframe as below

import pandas as pd
df = pd.DataFrame({"col":['AM RLC, F C', 'AM/F C', 'DM','D C']})

    |col
-------------------|
0   |"AM RLC, F C" |
1   |"AM/F C"      |
2   |"DM"          |
3   |"D C"         |

My expected output is as following

    |col
----|-----------------------|
 0  |["AM", "RLC", "F", "C"]|
 1  |["AM", "F", "C"]       |
 2  |["DM" ]                |
 3  |["D", "C"]             |

",", "/" and "space" should be treated as delimiter,

The answers in this question do not answer my queries

Macosso
  • 1,352
  • 5
  • 22

4 Answers4

4

I would use str.split or str.findall:

df['col'] = df['col'].str.split('[\s,/]+')

# or
df['col'] = df['col'].str.findall('\w+')

Output:

               col
0  [AM, RLC, F, C]
1       [AM, F, C]
2             [DM]
3           [D, C]

Regex:

[\s,/]+  # at least one of space/comma/slash with optional repeats

\w+      # one or more word characters
mozway
  • 194,879
  • 13
  • 39
  • 75
3

try this:

df["col"].apply(lambda x:x.replace(",","").replace("/"," ").split(" "))
Mouad Slimane
  • 913
  • 3
  • 12
2

An one-liner that finds any punctuation in your string and replaces it with empty space. Then you can split the string and get a clean list:

import string

df['col'].str.replace(f'[{string.punctuation}]', ' ', regex=True).str.split().to_frame()
ali bakhtiari
  • 1,051
  • 4
  • 23
  • This won't work as you showed it, there are many mistakes, f-string is incorrect, you need to craft and use a regex, and replace with a space. For instance: `df['col'].str.replace(f'[{re.escape(string.punctuation)}]+', ' ', regex=True).str.split()` – mozway Jan 05 '23 at 16:02
  • 1
    You are right, thanks for your vigilance. I edited my original answer and it works now. – ali bakhtiari Jan 05 '23 at 16:07
2

Apply a function on rows of col column to filter its content. In this case the function is written in lambda form.

import pandas as pd
import re

df = pd.DataFrame({"col":['AM RLC, F C', 'AM/F C', 'DM','D C']})

df['col'] = df['col'].apply(lambda x: str(re.findall(r"[\w']+", x)))

print(df.head())

output:

                       col
0  ['AM', 'RLC', 'F', 'C']
1         ['AM', 'F', 'C']
2                   ['DM']
3               ['D', 'C']
Ali
  • 350
  • 3
  • 10