0

Consider the following dataframe, (a snippet of the original):

df = pd.DataFrame({
    'netflix_id':       {1: 81197050, 2: 81213894, 3: 81082007},
    'title':            {1: 'guatemala: heart of the mayan world',2: 'the zoya factor',3: 'atlantics'},
    'type':             {1: 'Movie', 2: 'Movie', 3: 'Movie'},
    'countries':        {1: 'united states', 2: 'india', 3: 'france, senegal, belgium'},
    })

output:

netflix_id title type countries
1 81197050 guatemala: heart of the mayan world Movie united states
2 81213894 the zoya factor Movie india
3 81082007 atlantics Movie france, senegal, belgium

Due to the size of the data, I have set some columns to pd.CategoricalDtype. df["countries"] is one of them.

As you can see from the above example, some entries were only produced by a single country. However others, such as df.loc[3,"countries"] have more than 1.

In these cases, I've worked around certain limitations by using some code I wrote:

df["countries"] = df["countries"].astype("category")
# returns a list of unique SINGULAR elements within a given column of dtype: category 
def unique_col(dataframe: pd.DataFrame, column_to_count: str) -> list:
    unique_set = set()
    for i in dataframe[column_to_count].cat.categories:
        if ',' in i:
            lst = i.split(',')
            lst_final = []
            for country in lst:
                if country != "":
                    lst_final.append(country.lstrip())
            unique_set.update(lst_final)
        else:
            unique_set.add(i)
    return list(unique_set)

unique_col(df, "countries")

Output: ['belgium', 'france', 'india', 'senegal', 'united states']

As you can see this is clunky. Whilst this works as a brute force workaround in python & pandas, I'm now stuck when it comes to visualisation. In this instance I'm working with Seaborn. For example, I can't simply do:

sns.histplot(
    data    = df,
    y       = "countries",
    stat    = "count",
    hue     = "type",
);

Obviously this will cause a huge mess with a larger dataset:

Output

Here are my questions:

  1. Given my methods above, how do 'unpack' all UNIQUE & SINGULAR instances within a categorical column for the histplot above, (see the unique_col() function I wrote to see what I mean)
  2. Does pandas already provide functionality for unpacking categories who's strings contain multiple comma seperated elements?
  3. Should I be using a different dtype for this column? Specifically one that's geared more towards multiple 'tags', if that makes sense?

Other than these two questions, I'm open to criticism regarding my whole approach provided you have an alternative suggestion.

P.S:

  • If anything in my post is unclear please criticise in comments & I'll update the OP
  • If I have missed anything from the stackoverflow guidelines, please criticise in comments & I'll update the OP
Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
Rykari
  • 144
  • 6
  • 1
    You could use pandas' `explode` to create new rows for individual countries. – JohanC Mar 26 '23 at 15:27
  • 1
    `out = df.assign(countries=lambda d: d['countries'].str.split(r',\s*')).explode('countries')` – mozway Mar 26 '23 at 15:48
  • 1
    @mozway Thanks. I did actually read the explode post but couldn't see how to apply it. Is there a way to do the opposite after 'explode' to return the df back to how it was? – Rykari Mar 26 '23 at 15:56
  • 1
    Something like: `out.groupby('netflix_id', as_index=False).agg(', '.join)` – mozway Mar 26 '23 at 16:12
  • @mozway Thanks man, yeah I saw another post on SO that mentioned using groupby.agg(.join) of some kind, I can figure it out from there. – Rykari Mar 26 '23 at 16:15
  • As seaborn sometimes has problems with a repeated index, you might need to use `.... .explode(..., ignore_index=True)`. – JohanC Mar 26 '23 at 19:04

0 Answers0