0

after several hours of try and error I am pretty sure this is the right place to get help.

Data: approx. 10M rows

movie_title         movie_year movie_decade               genres         rating

Boomerang             1992         1990                Comedy|Romance      3.5     
Net, The              1995         2000         Action|Crime|Thriller      4.5
Dumb & Dumber         1994         1990                        Comedy      5
Outbreak              1995         2000  Action|Drama|Sci-Fi|Thriller      4.5
Stargate              1994         1990       Action|Adventure|Sci-Fi      3
Star Trek: Generations1994         1990 Action|Adventure|Drama|Sci-Fi      3.5

Target: Boxplots with jitter for the top 5 groups (group_by(genres)) with highest number of ratings

Grouping, summarizing and arranging gives the following top 5:

genres                           no_of_ratings
 1 Drama                            815084
 2 Comedy                           778596
 3 Comedy|Romance                   406061
 4 Comedy|Drama                     359494
 5 Comedy|Drama|Romance             290231

Similar example

So, in fact, I want to create a similar chart like this (genres is here top 30) but with box plots and jitter instead of points and error bars.

geom_point() for mean value + error bars for standard deviation

My (Non-)solution so far

temp %>%
    group_by(genres) %>%
    summarise(no_of_ratings = length(rating)) %>%
    arrange(desc(no_of_ratings)) %>%
    top_n(5, no_of_ratings) %>%
    ggplot(aes(genres, rating)) +
    geom_boxplot() +
    geom_jitter(width = 0.1, alpha = 0.2) + 
    coord_flip()

Results in the following error, which I can fully understand but not solve... object 'rating' not found

How can I group by genres, select only the top_n's by not losing the single value for rating?

dakaru
  • 1
  • 2
  • If I'm following, why don't you just do this in 2 steps? First, create the dataframe with total ratings by genre. Then, filter the original data to just those genres and create the plot of rating distributions. – John J. Apr 13 '23 at 16:22

1 Answers1

0

You have to make sure your rating column is to your data. Instead you could mutate the results, please note we have small sample so results look different:

library(tidyverse)
temp %>%
  group_by(genres) %>%
  mutate(no_of_ratings = length(rating)) %>%
  arrange(desc(no_of_ratings)) %>%
  top_n(5, no_of_ratings) %>%
  ggplot(aes(genres, rating)) +
  geom_boxplot() +
  geom_jitter(width = 0.1, alpha = 0.2) + 
  coord_flip()

Created on 2023-04-13 with reprex v2.0.2


Data used:

temp = read.table(text = "movie_title         movie_year movie_decade               genres         rating

Boomerang             1992         1990                Comedy|Romance      3.5     
Net,_The              1995         2000         Action|Crime|Thriller      4.5
Dumb_&_Dumber         1994         1990                        Comedy      5
Outbreak              1995         2000  Action|Drama|Sci-Fi|Thriller      4.5
Stargate              1994         1990       Action|Adventure|Sci-Fi      3
Star_Trek:_Generations 1994         1990 Action|Adventure|Drama|Sci-Fi      3.5", header = TRUE)
Quinten
  • 35,235
  • 5
  • 20
  • 53
  • Thanks Quinten. I have already tried this as well, but as you can also see in your chart, the top_n = 5 is somehow ignored. Your chart is also showing 6 genres. – dakaru Apr 13 '23 at 20:25