Currently I'm struggling to find an elegant way to program a problem I'm facing.
I have a large dataframe, containing a column with Department names:
INPUT
demo = pd.DataFrame(
{'Department':
['AA','AA1','BB team 1','AA but also a bit of nonsense',
'BB','AA','department BB','Complete nonsense']}
)
Department
AA
AA1
BB team 1
AA but also a bit of nonsense
BB
AA
department BB
Complete nonsense
I also have a list of known departments:
known_departments = ['AA','BB']
As one can see, there are three type of departments in the list:
- Departments that are exact matches with known departments, these should stay untouched.
- Departments that are variants of the known departments. That is: it contains the department name, but some other text is present. These should be mapped to the original known departments.
- Complete nonsense departments, with no match whatsoever with the known departments, these should stay untouched as well.
Desired output
Department Department_simplified
AA AA
AA1 AA
BB team 1 BB
AA but also a bit of nonsense AA
BB BB
AA AA
department BB BB
Complete nonsense Complete nonsense
UPDATE
Thanks Chris and sophocles for your answers. While looking more elegant by using str.extract
and str.findall
, performance-wise the apply+function outperforms both on my actual df:
Solution %%timeit -n20
Chris 1.65s ± 311 ms per loop (mean ± std. dev. of 7 runs, 20 loops each)
sophocles 1.14s ± 294 ms per loop (mean ± std. dev. of 7 runs, 20 loops each)
Paul 680 ms ± 174 ms per loop (mean ± std. dev. of 7 runs, 20 loops each)