-2

I have a large dataframe consisting of 3 million rows and 23 columns. If a partial match is found then I use np.select to add a new column and add the values that meet the specified condition.

My code:

conditions = [
    (DISK_data["Maatregel_naam"].str.contains("(?:^|\s)[vV]erv.*?")), 
    (DISK_data["Maatregel_naam"].str.contains("(?:^|\s)[hH]erst.*?")),
    (DISK_data["Maatregel_naam"].str.contains("(?:^|\s)[cC]ons.*?")),
    (DISK_data["Maatregel_naam"].str.contains("(?:^|\s)[oO]nderh.*?")),
    (DISK_data["Maatregel_naam"].str.contains("(?:^|\s)[rR]epar.*?")),
    (DISK_data["Maatregel_naam"].str.contains("(?:^|\s)[gG]ara.*?")),
    ] 
values = ["vervangen", "herstellen", "conserveren", "conserveren", "herstellen", "garantie"]
DISK_data["onderdeel"] = np.select(conditions, values, default="anders")

Here is a subset of my dataframe:

Maatregel_naam
1 vervangen beton
2 Vervangen staal
3 Staal vervang.
4 Staal vervangen door
5 Vervangen
6 herstellen
7 Herstellen

How can I adjust my regular expersion so that it returns a match of all the forms of the word "vervangen"? In my dataframe you can see that the word is not fully written or placed in the same location of the string.

With the Regex documentation and similar post I can't figure it out quiet yet since it does no fully solve my problem. Any help is appreciated.

Tessa
  • 53
  • 6

1 Answers1

0

Use word boundary anchors and make your regex case-insensitive:

Example:

DISK_data["Maatregel_naam"].str.contains(r"(?i)\bverv\w*(?:\b|\.)")

This matches a word that starts with verw and ends in a word boundary or a period.

Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561