0

I have the pandas dataframe with column name as TEXT consists strings,

TEXT
tom hardy played as bane in movie called dark knight rises.
will smith created the controversy in oscars 2023
famous movie actress emily blunt plays opposite football star messi in her next movie.
argentina has won the penalty shootout in world cup final match 
lebron james is the biggest sport star as he breaks the free throw record.
survivors living in fear after consecutive earthquakes tremors.

along with this, I have 3 dictionaries (this is just an example the actual dictionaries consist hundreds of keys) as:

main_category = {'movies': ['movie', 'cinema', 'picture', 'video', 'film', 'oscars', 'cinematography', 'director', 'producer'], 'sports': ['game ', 'sport', 'athletics', 'field', 'ground', 'olympics', 'games']}

sub_category = {'actor': ['tom hardy', 'will smith', 'vin diesel', 'matt damon'], 'actress': ['emily blunt', 'sandra bullock', 'emma watson', 'mila kunis', 'brie larson', 'julia roberts'], 'football': ['goal', 'kick off', 'goalkeeper', 'penalty','messi'], 'basketball': ['block', 'bounce', 'dribble', 'drive', 'hoop', 'free throw', 'rebound']}

main_sub_category_tree = {'movies': ['actor','actress'], 'sports': ['football' ,'basketball']}

main_sub_category_tree is the description of which sub_category falls under which main_category.

Now, all I want to,

i) First, assign the main_category as column name to each TEXT row based on the keywords matching.

ii) once main_category got mapped, only its corresponding sub_category will be mapped again based on further keywords matching.

iii) If no keywords found in main_category the output will be NA.

The final output will be,

TEXT                                                            movies     sports
tom hardy played as 'bane' in movie called dark knight          actor      NA
 rises.
will smith created the controversy in oscars 2023               actor      NA
famous movie actress emily blunt plays opposite football        actress    football
star messi in her next movie.                                                                  
argentina has won the penalty shootout in world cup final       football   NA
match.
lebron james is the biggest sport star as he breaks the free    basketball NA
throw record.
survivors living in fear after consecutive earthquakes tremors.  NA        NA

The columns of dataframe will be the main_category based on matching while column values are the `sub_cateogry'.

Hope it helps!!!

Learner
  • 800
  • 1
  • 8
  • 23

0 Answers0