I have a dataframe similar to the following with 731 observations. Please assume that the 'Has cat X' or 'No cat X' is a string, formatted correctly.
ID CATEGORY 1 CATEGORY 2 CATEGORY 3 CATEGORY 4 CATEGORY 5 CLASS
1 Has cat 1 Has cat 2 No cat 3 Has cat 4 No cat 5 1
2 No cat 1 No cat 2 No cat 3 Has cat 4 No cat 5 2
3 No cat 1 Has cat 2 Has cat 3 Has cat 4 Has cat 5 4
4 Has cat 1 Has cat 2 Has cat 3 Has cat 4 Has cat 5 4
5 No cat 1 Has cat 2 Has cat 3 Has cat 4 No cat 5 2
6 No cat 1 No cat 2 No cat 3 No cat 4 Has cat 5 1
7 Has cat 1 Has cat 2 No cat 3 No cat 4 No cat 5 3
8 No cat 1 No cat 2 Yes cat 3 No cat 4 No cat 5 4
9 No cat 1 Has cat 2 Has cat 3 Has cat 4 No cat 5 1
10 Has cat 1 Has cat 2 Has cat 3 Has cat 4 Has cat 5 1
Each observation has a CLASS. Each observation has either "no cat" or "has cat" for five categories. I'd like to display an alluvial plot visualizing the flow between the five CATEGORY variables and the CLASS variable. Because the CATEGORY variables are not mutually exclusive (i.e., observations can belong to more than one CATEGORY variable), my alluvial plot looks like I have 1500+ observations.
Table 1 - comparing categories (rows) and classes (columns)
Alluvial plot visualizing network change between categories and classes
Currently my crosstabs dataframe looks like this, with the frequencies from Table 1. This is what I use for data in my alluvial plot code.
Alluvial code:
ggplot(data = frequencies,aes(axis1 = category, axis2 = class, y = frequency)) +
scale_x_discrete(limits = c("Category", "Class"), expand = c(.2, .05)) +
xlab("Class Assignment") +
geom_alluvium(aes(fill = category)) +
geom_stratum() +
geom_text(stat = "stratum", aes(label = after_stat(stratum))) +
xlab(label = "Category and Class")+
ylab(label='Frequency')+
theme_minimal() +
ggtitle("XXX")
Is there a better way of setting up the frequency data to make the alluvial plot reflective of the true number of observations (n=731)? Or is an alluvial plot inappropriate and I should consider a different visualization method?
Thanks in advance and apologies for any formatting issues - new to Rstudio and Stack Overflow.