Regex match partial word not capturing all variants of the same word

Question

I have a large dataframe consisting of 3 million rows and 23 columns. If a partial match is found then I use np.select to add a new column and add the values that meet the specified condition.

My code:

conditions = [
    (DISK_data["Maatregel_naam"].str.contains("(?:^|\s)[vV]erv.*?")), 
    (DISK_data["Maatregel_naam"].str.contains("(?:^|\s)[hH]erst.*?")),
    (DISK_data["Maatregel_naam"].str.contains("(?:^|\s)[cC]ons.*?")),
    (DISK_data["Maatregel_naam"].str.contains("(?:^|\s)[oO]nderh.*?")),
    (DISK_data["Maatregel_naam"].str.contains("(?:^|\s)[rR]epar.*?")),
    (DISK_data["Maatregel_naam"].str.contains("(?:^|\s)[gG]ara.*?")),
    ] 
values = ["vervangen", "herstellen", "conserveren", "conserveren", "herstellen", "garantie"]
DISK_data["onderdeel"] = np.select(conditions, values, default="anders")

Here is a subset of my dataframe:

Maatregel_naam
1 vervangen beton
2 Vervangen staal
3 Staal vervang.
4 Staal vervangen door
5 Vervangen
6 herstellen
7 Herstellen

How can I adjust my regular expersion so that it returns a match of all the forms of the word "vervangen"? In my dataframe you can see that the word is not fully written or placed in the same location of the string.

With the Regex documentation and similar post I can't figure it out quiet yet since it does no fully solve my problem. Any help is appreciated.

score 0 · Accepted Answer · answered Jul 11 '23 at 15:36

0

Use word boundary anchors and make your regex case-insensitive:

Example:

DISK_data["Maatregel_naam"].str.contains(r"(?i)\bverv\w*(?:\b|\.)")

This matches a word that starts with verw and ends in a word boundary or a period.

answered Jul 11 '23 at 15:36

Tim Pietzcker

328,213
58
503
561

Regex match partial word not capturing all variants of the same word

1 Answers1