1

8 years and eight months ago, the first part of my question was answered perfectly for a boxplot with a missing grouping level by Stephan Kolassa when one level (12) was missing:

How to do a boxplot in R with a missing grouping level

data <- data.frame(y=rnorm(200),month=sample(c(1:11,13:15),200,replace=TRUE))
with(data,boxplot(y~factor(month,levels=1:15)))

But how should I create a dataframe not just for month as a factor but combined with a two-level factor, say age (young vs old)?

I have tried several possibilities without success.

2 Answers2

0

Let's create some data including a second factor (note that I already define the original data as ordered factors):

nn <- 1000
set.seed(1)  # for replicability
data <- data.frame(y=rnorm(nn),
    month=ordered(sample(c(1:11,13:15),nn,replace=TRUE),levels=1:15),
    age=ordered(sample(c("young","old"),nn,replace=TRUE),levels=c("young","old")))
colors <- structure(c("green","red"),.Names=levels(data$age))

In principle, R deals quite gracefully with interactions of factors (as for statistical models, and specified in the same way), so the following "works":

with(data,boxplot(y~age*month))

Unfortunately, it looks very bad. While we do have two missing boxplots, everything else is jumbled together, and the annotations on the horizontal axis are hard to understand. (Also, if we feed in the colors, the order comes out all wrong.)

The key is to call boxplot() without plotting and store the results, which contain all the information we need to plot the boxplot. Afterwards, we plot the boxplots in a nicer and more informative way.

Calculate all the information and store them without plotting:

foo <- with(data,boxplot(y~age*month, plot=FALSE))

Take a look at foo and see how it contains all the information. Specifically, note that foo$names tells us that the data is ordered such that we first have the two entries for month 1, then two for month 2 etc. (rather than first having 15 entries for young, followed by 15 for old, which we would get by specifying month*age rather than age*month, in which case we would need to adapt the subsequent script accordingly).

Here is a little magic to extract the age category for each entry, using strsplit on foo$names:

(age_category <- sapply(strsplit(foo$names,".",fixed=TRUE),"[",1))

Now, let's plot this. First, we will group the two boxplots for the two age categories in each month together and leave some space between the months. For this, we specify where we want to plot:

n_months <- max(as.numeric(as.character(data$month)))
(xx <- as.vector(rbind(3*(0:(n_months-1))+1, 3*(0:(n_months-1))+2)))

Now we plot, using the bxp() function, feeding the xx vector into the at parameter, and using our named (!) vector of colors above for the boxfill parameter. We suppress the horizontal axis.

bxp(foo, at=xx, boxfill=colors[age_category], las=1, xlab="Month", xaxt="n")

Note how the two boxplots for month=12 are missing. It should also work if some particular combination between month and age has no data. Add the horizontal axis:

axis(1,at=3*(1:n_months)-1.5,labels=1:n_months)

Finally, add a legend. Be careful not to cover any data points (the misleadingly so-called "outliers"), potentially add some vertical space using ylim in bxp():

legend("top", fill=colors, legend=names(colors))

boxplots

Alternatively, it looks like ggplot2 has some built-in support for grouped boxplots.

Stephan Kolassa
  • 7,953
  • 2
  • 28
  • 48
  • Hi Stephan, thanks for your elaborate response. This is perfect! – Frans van der Slik Dec 24 '22 at 15:22
  • Hi Stephan, I have tried to apply your codes to a sample of my own data. Unfortunately, it didn't work the way I expected. SO doesn't allow to upoad data, but perhaps the error message is helpfull already. – Frans van der Slik Dec 29 '22 at 07:55
  • I have ordered my data the following way: – Frans van der Slik Dec 29 '22 at 07:56
  • ``` df3$x <- factor(df3$x, levels = c(1976, 1979:1981,1983), ordered = TRUE) df3$age <- factor(df3$age,levels=c("young", "old"), ordered = TRUE) colors <- structure(c("green","red"),.Names=levels(df3$age)) ``` – Frans van der Slik Dec 29 '22 at 07:58
  • ` bxp(foo, at=xx, boxfill=colors[age_category], las=1, xlab="Year", xaxt="n", add = TRUE) ` _Error in bxp(foo, at = xx, boxfill = colors[age_category], las = 1, xlab = "Year", : 'at' must have same length as 'z$n', i.e. 10_ – Frans van der Slik Dec 29 '22 at 08:02
  • Hi Frans, apologies, I was quite busy the last couple of days. It may already be helpful to do `df3$x <- factor(df3$x, levels = c(1976:1983), ordered = TRUE)` (explicitly adding the missing years as possible factor levels) and `xx <- levels(df3$x)`. If that does not work, you should be able to edit your own post and include the output of `dput(df3)` right before you call `boxplot()`, which would make it easier to help you. If you decide to edit your post with this data, then please ping me in the comments, and I will try to help. – Stephan Kolassa Jan 04 '23 at 10:33
0

I'm very grateful that you want to pay attention to this issue. I have followed your suggestions without the desired outcome, however. I have a data file with three gaps, 1977, 1978, and 1982, ranging from 1977:1983, the data file contains 350 observations. This is my code:

df3 <- get(load("example.Rdata"))
df3$x <- factor(df3$x, levels = c(1976:1983), ordered = TRUE)
xx <- levels(df3$x)
df3$age <- factor(df3$age,levels=c("young", "old"), ordered = TRUE)
colors <- structure(c("green","red"),.Names=levels(df3$age))
with(df3,boxplot(y~x*age))
foo <- with(df3,boxplot(y~x*age, plot=FALSE))
(age_category <- sapply(strsplit(foo$names,".",fixed=TRUE),"[",1))
options(max.print = 1500000)
n_x <- max(as.numeric(as.character(df3$x)))
(xx <- as.vector(rbind(3*(0:(n_x-1))+1, 3*(0:(n_x-1))+2)))
head(xx)

dput(df3)
structure(list(y = c(35L, 43L, 44L, 23L, 53L, 24L, 36L, 52L, 
49L, 49L, 49L, 43L, 33L, 39L, 44L, 34L, 49L, 23L, 26L, 28L, 50L, 
37L, 30L, 43L, 45L, 43L, 39L, 35L, 20L, 28L, 53L, 52L, 44L, 55L, 
52L, 43L, 45L, 30L, 55L, 52L, 43L, 55L, 44L, 42L, 32L, 46L, 18L, 
33L, 45L, 46L, 43L, 56L, 56L, 36L, 32L, 46L, 32L, 49L, 36L, 40L, 
46L, 38L, 43L, 46L, 45L, 46L, 34L, 45L, 38L, 44L, 29L, 50L, 43L, 
55L, 43L, 41L, 44L, 25L, 45L, 42L, 30L, 45L, 32L, 42L, 49L, 33L, 
41L, 27L, 57L, 49L, 37L, 48L, 45L, 44L, 24L, 37L, 39L, 35L, 42L, 
60L, 40L, 52L, 55L, 48L, 37L, 38L, 54L, 36L, 50L, 42L, 39L, 34L, 
34L, 35L, 26L, 21L, 41L, 21L, 43L, 40L, 50L, 50L, 50L, 25L, 38L, 
48L, 34L, 46L, 59L, 44L, 51L, 38L, 37L, 43L, 45L, 52L, 53L, 42L, 
54L, 45L, 55L, 37L, 44L, 55L, 33L, 50L, 39L, 44L, 36L, 43L, 42L, 
26L, 40L, 36L, 30L, 29L, 46L, 41L, 28L, 44L, 48L, 30L, 40L, 39L, 
49L, 37L, 54L, 42L, 38L, 36L, 46L, 44L, 27L, 49L, 49L, 30L, 40L, 
21L, 51L, 58L, 53L, 40L, 37L, 56L, 36L, 51L, 36L, 57L, 51L, 41L, 
30L, 39L, 41L, 42L, 31L, 28L, 34L, 49L, 42L, 35L, 42L, 42L, 52L, 
27L, 47L, 47L, 44L, 24L, 38L, 56L, 38L, 48L, 34L, 27L, 44L, 31L, 
48L, 42L, 48L, 48L, 53L, 34L, 53L, 28L, 29L, 37L, 36L, 58L, 20L, 
51L, 31L, 29L, 47L, 36L, 42L, 37L, 42L, 45L, 55L, 32L, 48L, 39L, 
39L, 45L, 24L, 26L, 46L, 54L, 29L, 47L, 37L, 38L, 49L, 32L, 38L, 
46L, 47L, 39L, 42L, 45L, 52L, 55L, 41L, 44L, 57L, 44L, 58L, 50L, 
30L, 27L, 22L, 42L, 50L, 35L, 28L, 46L, 53L, 51L, 42L, 49L, 42L, 
58L, 52L, 39L, 51L, 50L, 52L, 43L, 42L, 38L, 43L, 46L, 38L, 36L, 
47L, 26L, 19L, 37L, 45L, 49L, 48L, 28L, 35L, 57L, 45L, 34L, 40L, 
32L, 28L, 47L, 25L, 54L, 44L, 37L, 55L, 56L, 26L, 49L, 39L, 45L, 
26L, 47L, 41L, 58L, 45L, 44L, 47L, 31L, 39L, 46L, 35L, 46L, 29L, 
40L, 40L, 48L, 19L, 39L, 35L, 30L, 38L, 42L, 46L, 48L, 25L, 28L, 
41L, 24L, 28L, 48L), x = structure(c(5L, 4L, 4L, 1L, 8L, 6L, 
4L, 1L, 1L, 6L, 4L, 4L, 4L, 8L, 1L, 5L, 6L, 8L, 4L, 6L, 6L, 4L, 
4L, 5L, 4L, 6L, 6L, 8L, 4L, 4L, 4L, 8L, 4L, 8L, 1L, 5L, 5L, 8L, 
6L, 5L, 4L, 4L, 4L, 5L, 6L, 6L, 1L, 1L, 8L, 1L, 4L, 5L, 8L, 5L, 
1L, 5L, 5L, 1L, 8L, 6L, 1L, 1L, 4L, 1L, 4L, 5L, 4L, 1L, 5L, 8L, 
5L, 8L, 1L, 5L, 5L, 1L, 5L, 1L, 5L, 6L, 5L, 8L, 6L, 1L, 4L, 1L, 
1L, 4L, 4L, 1L, 5L, 6L, 8L, 6L, 5L, 5L, 8L, 6L, 6L, 4L, 5L, 5L, 
5L, 6L, 5L, 6L, 4L, 6L, 4L, 8L, 8L, 8L, 1L, 8L, 6L, 5L, 8L, 8L, 
8L, 6L, 6L, 5L, 8L, 6L, 6L, 1L, 1L, 6L, 8L, 5L, 1L, 5L, 6L, 1L, 
1L, 1L, 4L, 8L, 6L, 5L, 6L, 1L, 5L, 6L, 4L, 8L, 4L, 6L, 6L, 1L, 
1L, 6L, 6L, 5L, 4L, 5L, 8L, 1L, 5L, 5L, 6L, 4L, 4L, 4L, 5L, 5L, 
5L, 5L, 4L, 4L, 6L, 6L, 6L, 6L, 6L, 1L, 6L, 4L, 4L, 5L, 4L, 8L, 
4L, 4L, 1L, 6L, 6L, 4L, 6L, 8L, 5L, 4L, 4L, 8L, 6L, 1L, 5L, 4L, 
8L, 1L, 8L, 1L, 5L, 8L, 4L, 5L, 6L, 1L, 5L, 5L, 1L, 8L, 1L, 6L, 
6L, 8L, 8L, 4L, 8L, 6L, 4L, 5L, 8L, 6L, 4L, 1L, 6L, 1L, 6L, 6L, 
1L, 1L, 5L, 4L, 6L, 4L, 4L, 8L, 1L, 5L, 5L, 8L, 1L, 8L, 5L, 5L, 
5L, 6L, 4L, 1L, 4L, 6L, 8L, 1L, 8L, 6L, 1L, 4L, 1L, 6L, 8L, 5L, 
4L, 5L, 4L, 1L, 6L, 8L, 5L, 1L, 4L, 6L, 4L, 5L, 1L, 6L, 1L, 1L, 
5L, 6L, 8L, 8L, 8L, 1L, 4L, 6L, 6L, 8L, 1L, 4L, 8L, 6L, 8L, 4L, 
8L, 5L, 5L, 1L, 8L, 8L, 6L, 4L, 5L, 1L, 6L, 8L, 6L, 8L, 1L, 1L, 
8L, 4L, 5L, 6L, 1L, 8L, 1L, 8L, 5L, 6L, 5L, 8L, 4L, 4L, 1L, 8L, 
4L, 6L, 8L, 8L, 4L, 1L, 4L, 4L, 8L, 5L, 5L, 1L, 5L, 4L, 1L, 4L, 
6L, 5L, 4L, 8L, 1L, 8L, 6L, 1L), .Label = c("1976", "1977", "1978", 
"1979", "1980", "1981", "1982", "1983"), class = c("ordered", 
"factor")), age = structure(c(2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 
2L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 
2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
2L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 
2L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 
1L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 
2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 
1L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 
2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 
2L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 
2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 
2L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 1L, 2L, 
2L, 2L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 
1L, 2L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 
2L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 
1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 
2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 2L, 
2L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 
1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 
2L, 1L, 1L, 2L, 2L, 1L), .Label = c("young", "old"), class = c("ordered", 
"factor"))), row.names = c(NA, -350L), class = "data.frame")`

bxp(foo, at=xx, boxfill=colors[age_category], las=1, xlab="Year", xaxt="n", add = TRUE)

Error in bxp(foo, at = xx, boxfill = colors[age_category], las = 1, xlab = "Year", : 'at' must have same length as 'z$n', i.e. 16

axis(1,at=3*(1:n_x)-1.5,labels=1:n_x)
legend("top", fill=colors, legend=names(colors))
with(df3,boxplot(y~factor(x,levels=1976:1983)* age_category))

I hope this info is helpful.