0

I really have two questions. I am quite certain that the second one would help me solve the first one, but I might be on the wrong track altogether and there might be simpler solutions.

First question: I would like to make a stacked bar chart using ggplot2 and geom_bar. I have time series data of many countries at 4 discrete intervals (years). I know it is unorthodox to show time series data as bar charts (and I am open to alternative suggestions). What I am trying to do is to show the bar charts as facet grids (by year) where the countries are shown in the same descending order based on the sum of all of my 4 variables in all of the 4 years. I would like to show only the first 25 countries.

To do all this, I have been using a combination of dplyr pipes and ggplot.

At first, I calculated a new column with

 data %>% 
  rowwise() %>% 
  mutate(total = sum(var1, var2, var3, var4, na.rm = T) %>% 
  arrange(desc(total)) %>% 
  top_n(100, total) %>% 
  ggplot...

but this will only show me the totals for each country-year pair and has some side effects like leaving some years blank for some countries because their values for these years did not make the top 100.

What I tried next, was to use the summarize function to add up the 4 variables across all 4 years like this:

 data %>% 
  summarize(sum = sum(var1, var2, var3, var4, na.rm = T))

but this reduces my table to two columns, which I know is the desired output, but I don't know how to get this new sum assigned to each respective country for all years.

I will try to reproduce both of these issues here:

Some data:

 countries <- c("country A", "country B", "country C", "country D", "country E")
  years <- rep(c(2014, 2015, 2016, 2017), 5)
  set.seed(123)
  var1 <- sample(1:1000, 20)
  var2 <- sample(1:1000, 20)
  var3 <- sample(1:1000, 20)
  var4 <- sample(1:1000, 20)
  data <- data.frame(countries, years, var1, var2, var3, var4)

  data %>% 
   rowwise() %>% 
   mutate(total = sum(var1, var2, var3, var4, na.rm = T)) %>% 
   gather(key, value, 3:6) %>% 
   top_n(32, total) %>% 
    ggplot(., aes(x = countries, y = value, fill = key)) + 
     geom_col() + 
     facet_grid(cols = vars(years)) + 
     coord_flip()

Undesired Output

As you can see, and as was expected by the code, R calculated the sum of each country-year pair, rather than the sum for each country for ALL years. I am seriously lost on how to proceed. Any help is appreciated!

If it makes any difference: lots of NAs in Var3 and Var4.

I forgot to illustrate the second issue:

  data %>% 
   group_by(countries) %>% 
   summarize(sum = sum(var1, var2, var3, var4, na.rm = T))

returns a table with countries and sums but how do I re-assign this new column to my original data frame?

Tea Tree
  • 882
  • 11
  • 26
  • 1
    TeaTree, I jad edited the question for formatting, and then your recent edit un-did the work. While I feel a little python-esque here imposing indentation, my intent was primarily for readability, not to say that that is the indentation style everybody should use. When I have to scroll horizontally for code, it is generally a nuisance and deterrent. Not to mention that I feel it defeats one of the advantages of *literate programming* that `dplyr` verbs offer. – r2evans Jun 14 '19 at 21:58
  • So sorry about this! I saw that somebody was editing but was not aware that I would undo your work... Is there any way to reconcile changes git-style? – Tea Tree Jun 14 '19 at 22:02
  • No worries, I don't mind, it's your question, my edits are still just suggestions to the OP. But I don't feel like getting into an editing-war with somebody else (it's happened), so I just stop after the first undo. You can see my edits in the question history, if you're curious. – r2evans Jun 14 '19 at 22:03
  • OK, I implemented the indentation style. Much more readable! – Tea Tree Jun 14 '19 at 22:09
  • Nice. For your second issue, because the number and/or order of rows has changed, you'll likely need to merge it back in, perhaps with `dplyr::left_join` or base `merge` (I recommend the former). – r2evans Jun 14 '19 at 22:10
  • This actually solved my problem. As suspected, adding this new summary variable to my data set allowed me to order my ggplot as desired. I would be happy to mark it as the solution if you posted it. – Tea Tree Jun 14 '19 at 22:43
  • 1
    I'm a little tight at the moment, feel free to answer it yourself and self-accept when SO lets you (in a day or three?). Glad you solved it! – r2evans Jun 14 '19 at 22:48

1 Answers1

0

Following r2evans, this solved the problem for me:

I first summed up all the values and saved this to a new dataset

data2 <- data %>% 
 group_by(countries) %>% 
 summarize(sum = sum(var1, var2, var3, var4, na.rm = T))

Then I left_joined the two data sets as such

 left_join(data, data2)

I could have specified by = countries but I didn't have to because it was the only common variable in both data sets.

While this solved the problem and I am forever grateful to r2evans, I am still wondering about a one-step solution. Please comment if you have one.

Tea Tree
  • 882
  • 11
  • 26