after several hours of try and error I am pretty sure this is the right place to get help.
Data: approx. 10M rows
movie_title movie_year movie_decade genres rating
Boomerang 1992 1990 Comedy|Romance 3.5
Net, The 1995 2000 Action|Crime|Thriller 4.5
Dumb & Dumber 1994 1990 Comedy 5
Outbreak 1995 2000 Action|Drama|Sci-Fi|Thriller 4.5
Stargate 1994 1990 Action|Adventure|Sci-Fi 3
Star Trek: Generations1994 1990 Action|Adventure|Drama|Sci-Fi 3.5
Target:
Boxplots with jitter for the top 5 groups (group_by(genres)
) with highest number of ratings
Grouping, summarizing and arranging gives the following top 5:
genres no_of_ratings
1 Drama 815084
2 Comedy 778596
3 Comedy|Romance 406061
4 Comedy|Drama 359494
5 Comedy|Drama|Romance 290231
Similar example
So, in fact, I want to create a similar chart like this (genres is here top 30) but with box plots and jitter instead of points and error bars.
geom_point() for mean value + error bars for standard deviation
My (Non-)solution so far
temp %>%
group_by(genres) %>%
summarise(no_of_ratings = length(rating)) %>%
arrange(desc(no_of_ratings)) %>%
top_n(5, no_of_ratings) %>%
ggplot(aes(genres, rating)) +
geom_boxplot() +
geom_jitter(width = 0.1, alpha = 0.2) +
coord_flip()
Results in the following error, which I can fully understand but not solve... object 'rating' not found
How can I group by genres, select only the top_n's by not losing the single value for rating?