Consider the following dataframe, (a snippet of the original):
df = pd.DataFrame({
'netflix_id': {1: 81197050, 2: 81213894, 3: 81082007},
'title': {1: 'guatemala: heart of the mayan world',2: 'the zoya factor',3: 'atlantics'},
'type': {1: 'Movie', 2: 'Movie', 3: 'Movie'},
'countries': {1: 'united states', 2: 'india', 3: 'france, senegal, belgium'},
})
output:
netflix_id | title | type | countries | |
---|---|---|---|---|
1 | 81197050 | guatemala: heart of the mayan world | Movie | united states |
2 | 81213894 | the zoya factor | Movie | india |
3 | 81082007 | atlantics | Movie | france, senegal, belgium |
Due to the size of the data, I have set some columns to pd.CategoricalDtype. df["countries"]
is one of them.
As you can see from the above example, some entries were only produced by a single country. However others, such as df.loc[3,"countries"]
have more than 1.
In these cases, I've worked around certain limitations by using some code I wrote:
df["countries"] = df["countries"].astype("category")
# returns a list of unique SINGULAR elements within a given column of dtype: category
def unique_col(dataframe: pd.DataFrame, column_to_count: str) -> list:
unique_set = set()
for i in dataframe[column_to_count].cat.categories:
if ',' in i:
lst = i.split(',')
lst_final = []
for country in lst:
if country != "":
lst_final.append(country.lstrip())
unique_set.update(lst_final)
else:
unique_set.add(i)
return list(unique_set)
unique_col(df, "countries")
Output: ['belgium', 'france', 'india', 'senegal', 'united states']
As you can see this is clunky. Whilst this works as a brute force workaround in python & pandas, I'm now stuck when it comes to visualisation. In this instance I'm working with Seaborn. For example, I can't simply do:
sns.histplot(
data = df,
y = "countries",
stat = "count",
hue = "type",
);
Obviously this will cause a huge mess with a larger dataset:
Here are my questions:
- Given my methods above, how do 'unpack' all UNIQUE & SINGULAR instances within a categorical column for the histplot above, (see the unique_col() function I wrote to see what I mean)
- Does pandas already provide functionality for unpacking categories who's strings contain multiple comma seperated elements?
- Should I be using a different dtype for this column? Specifically one that's geared more towards multiple 'tags', if that makes sense?
Other than these two questions, I'm open to criticism regarding my whole approach provided you have an alternative suggestion.
P.S:
- If anything in my post is unclear please criticise in comments & I'll update the OP
- If I have missed anything from the stackoverflow guidelines, please criticise in comments & I'll update the OP