6

I was trying to use ggplot2 to plot the built-in anscombe data set in R (which contains four different small data sets with identical correlations but radically different relationships between X and Y). My attempts to reshape the data properly were all pretty ugly. I used a combination of reshape2 and base R; a Hadleyverse 2 (tidyr/dplyr) or a data.table solution would be fine with me, but the ideal solution would be

  • short/no repeated code
  • comprehensible (somewhat conflicting with criterion #1)
  • involve as little hard-coding of column numbers, etc. as possible

The original format:

 anscombe
 ##     x1 x2 x3 x4    y1   y2   y3     y4
 ##  1  10 10 10  8  8.04 9.14  7.46  6.58
 ##  2   8  8  8  8  6.95 8.14  6.77  5.76
 ##  3  13 13 13  8  7.58 8.74 12.74  7.71
 ## ...
 ## 11   5  5  5  8  5.68 4.74  5.73  6.89

Desired format:

 ##    s  x    y
 ## 1  1 10 8.04
 ## 2  1  8 6.95
 ## ...
 ## 44 4  8 6.89

Here's my attempt:

 library("reshape2")
 ff <- function(x,v) 
     setNames(transform(
        melt(as.matrix(x)),
             v1=substr(Var2,1,1),
             v2=substr(Var2,2,2))[,c(3,5)],
          c(v,"s"))
 f1 <- ff(anscombe[,1:4],"x")
 f2 <- ff(anscombe[,5:8],"y")
 f12 <- cbind(f1,f2)[,c("s","x","y")]

Now plot:

 library("ggplot2"); theme_set(theme_classic())
 th_clean <- 
  theme(panel.margin=grid::unit(0,"lines"),
    axis.ticks.x=element_blank(),
    axis.text.x=element_blank(),
    axis.ticks.y=element_blank(),
    axis.text.y=element_blank()
    )
ggplot(f12,aes(x,y))+geom_point()+
  facet_wrap(~s)+labs(x="",y="")+
  th_clean

enter image description here

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • 3
    you should check out the new data.table `melt`. It can combine all the ys and xs in one shot, `dat <- as.data.table(anscombe); melt(dat, measure.vars = patterns(c("x", "y")), value.name=c('x', 'y'))` – Rorschach Nov 08 '15 at 22:22
  • @TheTime, why not go ahead and post that as an answer? – Ben Bolker Nov 08 '15 at 22:24
  • I think that should do it, it might be the newest version off github that has it I cant remember – Rorschach Nov 08 '15 at 22:28
  • 2
    use a `varying` list of length two for a double melt cheeseburger `reshape(anscombe, dir = 'long', varying = list(1:4, 5:8))[, 1:3]` – rawr Nov 08 '15 at 22:41
  • @TheTime, it seems to be in 1.9.6 ... – Ben Bolker Nov 08 '15 at 23:08
  • 1
    @rawr - or specify `sep=""` again, so that reshape automatically figures everything out - `reshape(anscombe, varying=1:ncol(anscombe), sep="", direction="long", timevar="s")` – thelatemail Nov 08 '15 at 23:50
  • @thelatemail i remembered that and tried it but I wasnt doing exactly that so it wasnt working. I'll get it one of these days – rawr Nov 08 '15 at 23:59

4 Answers4

9

If you are really dealing with the "anscombe" dataset, then I would say @Thela's reshape solution is very direct.

However, here are a few other options to consider:

Option 1: Base R

You can write your own "reshape" function, perhaps something like this:

myReshape <- function(indf = anscombe, stubs = c("x", "y")) {
  temp <- sapply(stubs, function(x) {
    unlist(indf[grep(x, names(indf))], use.names = FALSE)
  })
  s <- rep(seq_along(grep(stubs[1], names(indf))), each = nrow(indf))
  data.frame(s, temp)
}

Notes:

  1. I'm not sure that this is necessarily less clunky than what you're already doing
  2. This approach will not work if the data are "unbalanced" (for example, more "x" columns than "y" columns.)

Option 2: "dplyr" + "tidyr"

Since pipes are the rage these days, you can also try:

library(dplyr)
library(tidyr)

anscombe %>%
  gather(var, val, everything()) %>%
  extract(var, into = c("variable", "s"), "(.)(.)") %>% 
  group_by(variable, s) %>%
  mutate(ind = sequence(n())) %>%
  spread(variable, val)

Notes:

  1. I'm not sure that this is necessarily less clunky than what you're already doing, but some people like the pipe approach.
  2. This approach should be able to handle unbalanced data.

Option 3: "splitstackshape"

Before @Arun went and did all that fantastic work on melt.data.table, I had written merged.stack in my "splitstackshape" package. With that, the approach would be:

library(splitstackshape)
setnames(
  merged.stack(
    data.table(anscombe, keep.rownames = TRUE), 
               var.stubs = c("x", "y"), sep = "var.stubs"), 
  ".time_1", "s")[]

A few notes:

  1. merged.stack needs something to treat as an "id", hence the need for data.table(anscombe, keep.rownames = TRUE), which adds a column named "rn" with the row numbers
  2. The sep = "var.stubs" basically means that we don't really have a separator variable, so we'll just strip out the stub and use whatever remains for the "time" variable
  3. merged.stack will work if the data are unbalanced. For instance, try using it with anscombe2 <- anscombe[1:7] as your dataset instead of "anscombe".
  4. The same package also has a function called Reshape that builds upon reshape to let it reshape unbalanced data. But it's slower and less flexible than merged.stack. The basic approach would be Reshape(data.table(anscombe, keep.rownames = TRUE), var.stubs = c("x", "y"), sep = "") and then rename the "time" variable using setnames.

Option 4: melt.data.table

This was mentioned in the comments above, but hasn't been shared as an answer. Outside of base R's reshape, this is a very direct approach that handles column renaming from within the function itself:

library(data.table)
melt(as.data.table(anscombe), 
     measure.vars = patterns(c("x", "y")), 
     value.name=c('x', 'y'), 
     variable.name = "s")

Notes:

  1. Will be insanely fast.
  2. Much better supported than "splitstackshape" or reshape ;-)
  3. Handles unbalanced data just fine.
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • wow! (I'll probably accept this answer because it's so comprehensive - although all the answers are good.) – Ben Bolker Nov 09 '15 at 03:17
  • you will always remember that it was a base solution that got you over 100k :} – rawr Nov 09 '15 at 04:41
  • After 5 years, the tidyverse has been reshaped as well :) As of today, gather and spread are being replaced by `pivot_wider` and `pivot_longer`. Instead of using `sequence(n())` one can also simply use `row_number()`. – tjebo Jul 12 '20 at 14:37
7

I think this meets the criteria of being 1) short 2) comprehensible and 3) no hardcoded column numbers. And it doesn't require any other packages.

reshape(anscombe, varying=TRUE, sep="", direction="long", timevar="s")

#     s  x     y id
#1.1  1 10  8.04  1
#...
#11.1 1  5  5.68 11
#1.2  2 10  9.14  1
#...
#11.2 2  5  4.74 11
#1.3  3 10  7.46  1
#...
#11.3 3  5  5.73 11
#1.4  4  8  6.58  1
#...
#11.4 4  8  6.89 11
thelatemail
  • 91,185
  • 12
  • 128
  • 188
3

I don't know if a non-reshape solution would be acceptable, but here you go:

library(data.table)
#create the pattern that will have the Xs
#this will make it easy to create the Ys
pattern <- 1:4
#use Map to create a list of data.frames with the needed columns
#and also use rbindlist to rbind the list produced by Map
lists <- rbindlist(Map(data.frame, 
                       pattern,
                       anscombe[pattern], 
                       anscombe[pattern+length(pattern)]
                       )
                   )
#set the correct names
setnames(lists, names(lists), c('s','x','y')) 

Output:

> lists
    s  x     y
 1: 1 10  8.04
 2: 1  8  6.95
 3: 1 13  7.58
 4: 1  9  8.81
 5: 1 11  8.33
 6: 1 14  9.96
 7: 1  6  7.24
 8: 1  4  4.26
 9: 1 12 10.84
10: 1  7  4.82
....
LyzandeR
  • 37,047
  • 12
  • 77
  • 87
  • this is nice, and I think it's only very weakly dependent on `data.table`: `rbindlist` -> `do.call(rbind,...)` and `setnames` -> `setNames` ... – Ben Bolker Nov 08 '15 at 23:10
  • Thanks @BenBolker. Yes, I just used `rbindlist` because of its speed, but as you say it can easily be converted to a base-only solution with `do.call` + `rbind`. – LyzandeR Nov 08 '15 at 23:12
2

A newer tidyverse option is suggested in the tidyverse vignette:

anscombe %>% 
  pivot_longer(everything(), 
    names_to = c(".value", "set"), 
    names_pattern = "(.)(.)"
  ) %>% 
  arrange(set)
#> # A tibble: 44 x 3
#>    set       x     y
#>    <chr> <dbl> <dbl>
#>  1 1        10  8.04
#>  2 1         8  6.95
#>  3 1        13  7.58
#>  4 1         9  8.81
#>  5 1        11  8.33
#>  6 1        14  9.96
#>  7 1         6  7.24
#>  8 1         4  4.26
#>  9 1        12 10.8 
#> 10 1         7  4.82
#> # … with 34 more rows
tjebo
  • 21,977
  • 7
  • 58
  • 94