First of all, I'm completely new to pandas, so I'm not sure if this is even called a "Transformation", but I've done this and it feels like there must be a much more efficient way (either in LOC or running time) to achieve the same. This is what I've written:
def findInCol(row, col, tags):
try:
for m in validTags:
if m.lower() in row[col].lower():
tags.add(m)
else:
m_fix = pat.sub('', m.lower())
row_fix = pat.sub('', row[col].lower())
if m_fix in row_fix:
tags.add(f"{m}")
except KeyError:
pass
def getTags(row):
tags = set()
findInCol(row, 'Ad_data', tags)
findInCol(row, 'Name', tags)
tags = clean(tags)
return ','.join(tags) if tags else "NA"
def clean(tags):
arr = list(tags)
remove = set()
for i in range(len(arr)):
for j in range(i+1, len(arr)):
i_l = pat.sub('', arr[i].lower())
j_l = pat.sub('', arr[j].lower())
if i_l in j_l:
remove.add(i)
elif j_l in i_l:
remove.add(j)
arr = [i for j, i in enumerate(arr) if j not in remove]
return arr
df['Tag'] = df.fillna("NA").apply(getTags, axis=1)
If anything needs clarification please ask for it. This essentially tries to find some valid tags inside the fields 'Ad_data' y 'Name'. Many times it will find more than one matching tag, which is fine. However, it could be the case that both 'Horse' and 'RedHorse' are valid tags, so after I do the initial search, I need to clean up to keep just the more specific tag (ie. RedHorse) This is performed in the clean function.
EDIT:
Here is a sample dataset. So, going from a table with just the Name (ignore the Ad_data for now, as it may not exist in some cases), and a separate list of valid tags defined in python code (eg. just ["Horse", "RedHorse"] in this case), I need to get the following output table (with the tag added as a column):