Finding if string has element from list and returning that element from list

Question

Currently I'm struggling to find an elegant way to program a problem I'm facing.

I have a large dataframe, containing a column with Department names:

INPUT

demo = pd.DataFrame(
    {'Department':
        ['AA','AA1','BB team 1','AA but also a bit of nonsense',
        'BB','AA','department BB','Complete nonsense']}
    )

Department
AA
AA1
BB team 1
AA but also a bit of nonsense
BB
AA
department BB 
Complete nonsense

I also have a list of known departments:

known_departments = ['AA','BB']

As one can see, there are three type of departments in the list:

Departments that are exact matches with known departments, these should stay untouched.
Departments that are variants of the known departments. That is: it contains the department name, but some other text is present. These should be mapped to the original known departments.
Complete nonsense departments, with no match whatsoever with the known departments, these should stay untouched as well.

Desired output

Department                      Department_simplified
AA                              AA
AA1                             AA
BB team 1                       BB
AA but also a bit of nonsense   AA
BB                              BB
AA                              AA
department BB                   BB
Complete nonsense               Complete nonsense

UPDATE

Thanks Chris and sophocles for your answers. While looking more elegant by using str.extract and str.findall, performance-wise the apply+function outperforms both on my actual df:

Solution    %%timeit -n20
Chris       1.65s ± 311 ms per loop (mean ± std. dev. of 7 runs, 20 loops each)
sophocles   1.14s ± 294 ms per loop (mean ± std. dev. of 7 runs, 20 loops each)
Paul        680 ms ± 174 ms per loop (mean ± std. dev. of 7 runs, 20 loops each)

Does this get you started? there are some vectorised solutions, too https://stackoverflow.com/questions/48590488/map-str-contains-across-pandas-dataframe — CallumDA, Dec 29 '21 at 15:50
@CallumDA. this works, but is the same solution as my answer below. I guess I'm more interested in the vectorised solutions. — Paul, Dec 29 '21 at 15:56
what are statistics? how often department is same as department_simplified? how many departments are there? how long is average department name, how long is average department_simplified? — dankal444, Dec 29 '21 at 21:34

Paul · Answer 1 · 2021-12-29T16:27:54.517

I currently use apply in combination with a function to get my result.

Code:

def item_in_string(string, list_of_items):
    for item in list_of_items:
        if item in string:
            return item
    return string

demo['Department_simplified'] = demo.Department.apply(
    lambda x: item_in_string(x, known_departments) if isinstance(x, str) else x)

However, this does not feel very efficient, nor pythonic.

I was wondering if someone has a better way to program this problem.

score 0 · Answer 2 · answered Dec 29 '21 at 15:56

You can use str.extract here, and build a list of deparments separated by | (or) as the pattern.

import pandas as pd

known_departments = ['AA','BB']

demo = pd.DataFrame(
    {'Department':
        ['AA','AA1','BB team 1','AA but also a bit of nonsense',
        'BB','AA','department BB','Complete nonsense']}
    )

demo['Department_simplified'] = demo.Department.str.extract(f"({'|'.join(known_departments)})")

# If you need to fill nulls with the original dept name
demo['Department_simplified'].fillna(demo['Department'], inplace=True)

print(demo)

Output

    Department Department_simplified
0                             AA                    AA
1                            AA1                    AA
2                      BB team 1                    BB
3  AA but also a bit of nonsense                    AA
4                             BB                    BB
5                             AA                    AA
6                  department BB                    BB
7              Complete nonsense     Complete nonsense

score 0 · Answer 3 · answered Dec 29 '21 at 15:56

You can first use str.findall to return the matching substrings of your Department column with your list elements (known_departments). For the ones that nothing is returned, you just use the values in Department, as there's no match whatsoever

demo['Department_simplified'] = demo['Department']\
    .str.findall('|'.join(known_departments)).str.join('')

demo['Department_simplified'] = np.where(
    demo['Department_simplified'].eq(''),demo['Department'],demo['Department_simplified'])

Prints:

                      Department Department_simplified
0                             AA                    AA
1                            AA1                    AA
2                      BB team 1                    BB
3  AA but also a bit of nonsense                    AA
4                             BB                    BB
5                             AA                    AA
6                  department BB                    BB
7              Complete nonsense     Complete nonsense

Finding if string has element from list and returning that element from list

INPUT

Desired output

UPDATE

3 Answers3