I have the pandas dataframe with column name as TEXT
consists strings,
TEXT
tom hardy played as bane in movie called dark knight rises.
will smith created the controversy in oscars 2023
famous movie actress emily blunt plays opposite football star messi in her next movie.
argentina has won the penalty shootout in world cup final match
lebron james is the biggest sport star as he breaks the free throw record.
survivors living in fear after consecutive earthquakes tremors.
along with this, I have 3 dictionaries (this is just an example the actual dictionaries consist hundreds of keys) as:
main_category = {'movies': ['movie', 'cinema', 'picture', 'video', 'film', 'oscars', 'cinematography', 'director', 'producer'], 'sports': ['game ', 'sport', 'athletics', 'field', 'ground', 'olympics', 'games']}
sub_category = {'actor': ['tom hardy', 'will smith', 'vin diesel', 'matt damon'], 'actress': ['emily blunt', 'sandra bullock', 'emma watson', 'mila kunis', 'brie larson', 'julia roberts'], 'football': ['goal', 'kick off', 'goalkeeper', 'penalty','messi'], 'basketball': ['block', 'bounce', 'dribble', 'drive', 'hoop', 'free throw', 'rebound']}
main_sub_category_tree = {'movies': ['actor','actress'], 'sports': ['football' ,'basketball']}
main_sub_category_tree
is the description of which sub_category
falls under which main_category
.
Now, all I want to,
i) First, assign the main_category
as column name to each TEXT
row based on the keywords matching.
ii) once main_category
got mapped, only its corresponding sub_category
will be mapped again based on further keywords matching.
iii) If no keywords found in main_category
the output will be NA
.
The final output will be,
TEXT movies sports
tom hardy played as 'bane' in movie called dark knight actor NA
rises.
will smith created the controversy in oscars 2023 actor NA
famous movie actress emily blunt plays opposite football actress football
star messi in her next movie.
argentina has won the penalty shootout in world cup final football NA
match.
lebron james is the biggest sport star as he breaks the free basketball NA
throw record.
survivors living in fear after consecutive earthquakes tremors. NA NA
The columns of dataframe will be the main_category
based on matching while column values are the `sub_cateogry'.
Hope it helps!!!