I really have two questions. I am quite certain that the second one would help me solve the first one, but I might be on the wrong track altogether and there might be simpler solutions.
First question: I would like to make a stacked bar chart using ggplot2 and geom_bar. I have time series data of many countries at 4 discrete intervals (years). I know it is unorthodox to show time series data as bar charts (and I am open to alternative suggestions). What I am trying to do is to show the bar charts as facet grids (by year) where the countries are shown in the same descending order based on the sum of all of my 4 variables in all of the 4 years. I would like to show only the first 25 countries.
To do all this, I have been using a combination of dplyr pipes and ggplot.
At first, I calculated a new column with
data %>%
rowwise() %>%
mutate(total = sum(var1, var2, var3, var4, na.rm = T) %>%
arrange(desc(total)) %>%
top_n(100, total) %>%
ggplot...
but this will only show me the totals for each country-year pair and has some side effects like leaving some years blank for some countries because their values for these years did not make the top 100.
What I tried next, was to use the summarize function to add up the 4 variables across all 4 years like this:
data %>%
summarize(sum = sum(var1, var2, var3, var4, na.rm = T))
but this reduces my table to two columns, which I know is the desired output, but I don't know how to get this new sum assigned to each respective country for all years.
I will try to reproduce both of these issues here:
Some data:
countries <- c("country A", "country B", "country C", "country D", "country E")
years <- rep(c(2014, 2015, 2016, 2017), 5)
set.seed(123)
var1 <- sample(1:1000, 20)
var2 <- sample(1:1000, 20)
var3 <- sample(1:1000, 20)
var4 <- sample(1:1000, 20)
data <- data.frame(countries, years, var1, var2, var3, var4)
data %>%
rowwise() %>%
mutate(total = sum(var1, var2, var3, var4, na.rm = T)) %>%
gather(key, value, 3:6) %>%
top_n(32, total) %>%
ggplot(., aes(x = countries, y = value, fill = key)) +
geom_col() +
facet_grid(cols = vars(years)) +
coord_flip()
As you can see, and as was expected by the code, R calculated the sum of each country-year pair, rather than the sum for each country for ALL years. I am seriously lost on how to proceed. Any help is appreciated!
If it makes any difference: lots of NAs in Var3 and Var4.
I forgot to illustrate the second issue:
data %>%
group_by(countries) %>%
summarize(sum = sum(var1, var2, var3, var4, na.rm = T))
returns a table with countries and sums but how do I re-assign this new column to my original data frame?