Properly using pd.Categorical Dtype in cases that contain multiple entries

Question

Consider the following dataframe, (a snippet of the original):

df = pd.DataFrame({
    'netflix_id':       {1: 81197050, 2: 81213894, 3: 81082007},
    'title':            {1: 'guatemala: heart of the mayan world',2: 'the zoya factor',3: 'atlantics'},
    'type':             {1: 'Movie', 2: 'Movie', 3: 'Movie'},
    'countries':        {1: 'united states', 2: 'india', 3: 'france, senegal, belgium'},
    })

output:

	netflix_id	title	type	countries
1	81197050	guatemala: heart of the mayan world	Movie	united states
2	81213894	the zoya factor	Movie	india
3	81082007	atlantics	Movie	france, senegal, belgium

Due to the size of the data, I have set some columns to pd.CategoricalDtype. df["countries"] is one of them.

As you can see from the above example, some entries were only produced by a single country. However others, such as df.loc[3,"countries"] have more than 1.

In these cases, I've worked around certain limitations by using some code I wrote:

df["countries"] = df["countries"].astype("category")
# returns a list of unique SINGULAR elements within a given column of dtype: category 
def unique_col(dataframe: pd.DataFrame, column_to_count: str) -> list:
    unique_set = set()
    for i in dataframe[column_to_count].cat.categories:
        if ',' in i:
            lst = i.split(',')
            lst_final = []
            for country in lst:
                if country != "":
                    lst_final.append(country.lstrip())
            unique_set.update(lst_final)
        else:
            unique_set.add(i)
    return list(unique_set)

unique_col(df, "countries")

Output: ['belgium', 'france', 'india', 'senegal', 'united states']

As you can see this is clunky. Whilst this works as a brute force workaround in python & pandas, I'm now stuck when it comes to visualisation. In this instance I'm working with Seaborn. For example, I can't simply do:

sns.histplot(
    data    = df,
    y       = "countries",
    stat    = "count",
    hue     = "type",
);

Obviously this will cause a huge mess with a larger dataset:

Output

Here are my questions:

Given my methods above, how do 'unpack' all UNIQUE & SINGULAR instances within a categorical column for the histplot above, (see the unique_col() function I wrote to see what I mean)
Does pandas already provide functionality for unpacking categories who's strings contain multiple comma seperated elements?
Should I be using a different dtype for this column? Specifically one that's geared more towards multiple 'tags', if that makes sense?

Other than these two questions, I'm open to criticism regarding my whole approach provided you have an alternative suggestion.

P.S:

If anything in my post is unclear please criticise in comments & I'll update the OP
If I have missed anything from the stackoverflow guidelines, please criticise in comments & I'll update the OP

You could use pandas' `explode` to create new rows for individual countries. — JohanC, Mar 26 '23 at 15:27
`out = df.assign(countries=lambda d: d['countries'].str.split(r',\s*')).explode('countries')` — mozway, Mar 26 '23 at 15:48
@mozway Thanks. I did actually read the explode post but couldn't see how to apply it. Is there a way to do the opposite after 'explode' to return the df back to how it was? — Rykari, Mar 26 '23 at 15:56
Something like: `out.groupby('netflix_id', as_index=False).agg(', '.join)` — mozway, Mar 26 '23 at 16:12
@mozway Thanks man, yeah I saw another post on SO that mentioned using groupby.agg(.join) of some kind, I can figure it out from there. — Rykari, Mar 26 '23 at 16:15
As seaborn sometimes has problems with a repeated index, you might need to use `.... .explode(..., ignore_index=True)`. — JohanC, Mar 26 '23 at 19:04

Properly using pd.Categorical Dtype in cases that contain multiple entries

0 Answers0