geom_bar, how to only make the x highest frequency appear?

Question

I'm working with a dataframe on state-sponsored cyberattacks (my main three variable are thus Date, Sponsor and Victim). I want to create a geom_bar where for each year, the top five victims of cyber attacks will appear.

I'm not sure how I could produce a reproductible example for this. I made a version where the overall top 5 victims appear, but it doesn't reflect change in target over the years.

cyber%>%
  filter(Sponsor_sep == "China" & 
         Victims_sep %in% c("United States", "China", "Japan", "South Korea", "India"))%>%
  ggplot() + 
  geom_bar(mapping = aes(x = Year, fill = Victims_sep))

EDIT: I followed @dandrews comment and created a sample

cyber <- tibble::tibble(
  Year = rep(c("2020", "2015", "2010", "2005"), c(73L, 53L, 9L, 4L)),
  Sponsor_sep = rep("China", 139L),
  Victims_sep = c(
    "Japan", "Australia", "Asia", "Australia", "Asia", "China",
    "China", "China", "United States", "United States", "China",
    "Japan", "Australia", "Australia", "Australia", "India", "Kazakhstan",
    "Kyrgyzstan", "Malaysia", "Russia", "Ukraine", "China", "United States",
    "United States", "Vietnam", "United States", "United States",
    "China", "China", "Malaysia", "Vietnam", "Asia", "China", "South Korea",
    "Myanmar", "China", "Myanmar", "United States", "China", "Vatican City",
    "China", "Vatican City", "China", "Japan", "Russia", "South Korea",
    "Japan", "Russia", "South Korea", "Japan", "Russia", "South Korea",
    "China", "International Organisations", "International Organisations",
    "Japan", "China", "United States", "United States", "United States",
    "United States", "United States", "Japan", "Russia", "South Korea",
    "International Organisations", "International Organisations",
    "Mongolia", "Mongolia", "Japan", "Asia", "Asia", "Mongolia",
    "India", "Thailand", "South Korea", "Saudi Arabia", "Malaysia",
    "United States", "Vietnam", "Cambodia", "Indonesia", "Myanmar",
    "China", "Laos", "Singapore", "Phillipines", "India", "Thailand",
    "South Korea", "Saudi Arabia", "Malaysia", "United States", "Vietnam",
    "Cambodia", "Indonesia", "Myanmar", "China", "Laos", "Singapore",
    "Phillipines", "Vietnam", "Vietnam", "Anthem", "United States",
    "United Kingdom", "China", "United States", "United Kingdom",
    "France", "United States", "United Kingdom", "France", "Thailand",
    "United States", "United States", "United States", "United States",
    "Malaysia", "Philippines", "India", "Indonesia", "United States",
    "United States", "United States", "Australia", "United States",
    "Asia", "India", "Australia", "United States", "United States",
    "International Organisations", "United States", "International Organisations",
    "United States", "United Kingdom", "United States", "United Kingdom"
  ),
)

It will be much easier to help if you post a reproducible example, with sample data for multiple years some of which contain different top 5 countries along with an annotated graph of how you expect the graph to appear. — Ian Campbell, Apr 30 '23 at 17:40
@Nea, all you need to do to make this reproducible is provide your data. You can use `dput(cyber)` then copy and paste the result. It doesn't appear too large but you can always subset the data if it is too much to copy/paste. There's going to be some data wrangling prior to plotting to achieve what you want but without the data it's too hard to assist. — dandrews, Apr 30 '23 at 21:09

dandrews · Accepted Answer · 2023-05-01T19:39:09.527

OK, I found this to be an interesting brain teaser so I took a shot. I first created some data to work with, but not that because these data were drawn randomly the resulting figure is not very interesting. However, the code seems to work even if it is clunky.

library(tidyverse)

# Create the data and add some extra countries so the output varies
cyber <- tibble(Year=sample(seq(2005,2022,1),50000,replace = T),
                Victims_sep=sample(c("United States", "China", "Japan", "South Korea", "India",
                                     'England','Spain','Vietnam','Canada','France','Bangladesh','Taiwan','Morocco'),
                                   50000,
                                   replace = T))

# Original plot from OP but with more countries 
cyber %>% 
ggplot() + 
  geom_bar(mapping = aes(x = Year, fill = Victims_sep))

# New plot
 cyber %>% 
  group_by(Year,Victims_sep) %>% 
  summarise(n=n()) %>% # get the number of attached in each year for each country
  ungroup() %>% 
  group_by(Year) %>% 
# get the number of attacks for the country with the most through 5th most in each year
  mutate(max_victim=max(n), 
         len=length(n),
         second=sort(n,partial=len-1)[len-1],
         third=sort(n,partial=len-2)[len-2],
         fourth=sort(n,partial=len-3)[len-3],
         fifth=sort(n,partial=len-4)[len-4]) %>% 
  rowwise() %>% 
  mutate(top5=ifelse(n %in% max_victim:fifth,1,0)) %>% # create an index 
  filter(top5==1) %>%  # keep only index values equal to 1
   
   ggplot() + 
   geom_col(mapping = aes(x = Year,y=n, fill = Victims_sep)) # use geom_col to apply the n value

UPDATE USING OP'S DATA

I think this works. Note that given the data provided each year can show more than 5 results in a column where there were equal number of attacks. For example, 2015 shows 8 countries, but 2 pairs of 3 have the same value.

 cyber %>% 
   mutate(Year=as.numeric(Year)) %>% 
  group_by(Year,Victims_sep) %>% 
  summarise(n=n()) %>% 
  ungroup() %>% 
  group_by(Year) %>% 
  mutate(max_victim=max(n),
         len=length(n),
         second=ifelse(len>=2,sort(n,partial=len-1)[len-1],0),
         third=ifelse(len>=3,sort(n,partial=len-2)[len-2],0),
         fourth=ifelse(len>=4,sort(n,partial=len-3)[len-3],0),
         fifth=ifelse(len>=5,sort(n,partial=len-4)[len-4],0)) %>%
  rowwise() %>% 
  mutate(top5=ifelse(n %in% max_victim:fifth,1,0)) %>% 
  filter(top5==1) %>% 
   
   ggplot() + 
   geom_col(mapping = aes(x = Year,y=n, fill = Victims_sep))

Thank you folr your help! I tried your solution but I got the following error: Error in `mutate()`: ℹ In argument: `second = sort(n, partial = len - 1)[len - 1]`. ℹ In group 2: `Year = "2006"`. Caused by error in `sort.int(): ! index 0 outside bounds` I think it comes from a problem with the df? Maybe there is not enough obs in the earlier years? Are you familiar with this type of error? — Nea, May 01 '23 at 17:17
@Nea Ah, so it looks like in your real data there are only 2 entries for that year as you suggest in your comment, and that's what the error is saying. Off the top of my head I am not sure how to program this into the above workflow, but perhaps some sort of logical statement inside the mutate call? — dandrews, May 01 '23 at 18:39
@Nea check the update and see if that does the trick. If it doesn't let me know. If you accept the answer I'll know it worked! — dandrews, May 01 '23 at 22:33

geom_bar, how to only make the x highest frequency appear?

1 Answers1