6

Does someone know how to create a graph like the one in the screenshot? I've tried to get a similar effect adjusting alpha, but this renders outliers to be almost invisible. I know this type of graph only from a software called FlowJo, here they refer to it as "pseudocolored dot plot". Not sure if this an official term.

Screenshot from Corces et al., Nature Genetics 2016

I'd like to do it specifically in ggplot2, as I need the faceting option. I attached another screenshot of one of my graphs. The vertical lines depict clusters of mutations at certain genomic regions. Some of these clusters are much denser than others. I'd like to illustrate this using the density colors.

Rainfall Plot

The data is quite big and hard to simulate, but here's a try. I doesn't look like the actual data, but the data format is the same.

chr <- c(rep(1:10,1000))
position <- runif(10000, min=0, max=5e8)
distance <- runif(10000, min=1, max=1e5)
log10dist <- log10(distance)

df1 <- data.frame(chr, position, distance, log10dist)

ggplot(df1, aes(position, log10dist)) + 
  geom_point(shape=16, size=0.25, alpha=0.5, show.legend = FALSE) +
  facet_wrap(~chr, ncol = 5, nrow = 2, scales = "free_x")

Any help is highly appreciated.

Peer Wünsche
  • 175
  • 3
  • 8
  • Thanks for the quick response. Looks just like what I wanted. The problem is, however, that I need the faceting option of ggplot2. I'll edit the post to give a more precise example. – Peer Wünsche Aug 19 '16 at 12:45
  • That looks like a hexbin plot. See `geom_hex`. – Roland Aug 19 '16 at 12:49
  • `smoothScatter()` calls (via `grDevices:::.smoothScatterCalcDensity()`) `KernSmooth::bkde2D()`then filters out the non-outliers. You can do the density plot with `ggalt::geom_bkde2d()` and plot the `geom_point()`s underneath it. You've provided no data for anyone to mock up something. – hrbrmstr Aug 19 '16 at 12:51
  • It's not a hexbin plot. – hrbrmstr Aug 19 '16 at 12:51
  • @hrbrmstr: not sure if I understood you correctly, but the `geom_bkde2D()` doesn't look like it is giving me the result I wanted. Maybe I have to try `smoothScatter` afterall, and paste the individual chromosomes on one page to get the faceting effect. I'll also try the hexbin. Let's see. – Peer Wünsche Aug 19 '16 at 13:36

3 Answers3

6
library(ggplot2)
library(ggalt)
library(viridis)

chr <- c(rep(1:10,1000))
position <- runif(10000, min=0, max=5e8)
distance <- runif(10000, min=1, max=1e5)
log10dist <- log10(distance)

df1 <- data.frame(chr, position, distance, log10dist)

ggplot(df1, aes(position, log10dist)) + 
  geom_point(shape=16, size=0.25, show.legend = FALSE) +
  stat_bkde2d(aes(fill=..level..), geom="polygon") +
  scale_fill_viridis() +
  facet_wrap(~chr, ncol = 5, nrow = 2, scales = "free_x")

enter image description here

In practice, I'd take the initial bandwidth guess and then figure out an optimal bandwidth. Apart from taking the lazy approach and just plotting the points w/o filtering (smoothScatter() filters everything but the outliers based on npoints) this is generating the "smoothed scatterplot" like the example you posted.

smoothScatter() uses different defaults, so it comes out a bit differently:

par(mfrow=c(nr=2, nc=5))
for (chr in unique(df1$chr)) {
  plt_df <- dplyr::filter(df1, chr==chr)
  smoothScatter(df1$position, df1$log10dist, colramp=viridis)
}

enter image description here

geom_hex() is going to show the outliers, but not as distinct points:

ggplot(df1, aes(position, log10dist)) + 
  geom_point(shape=16, size=0.25, show.legend = FALSE, color="red") +
  scale_fill_viridis() +
  facet_wrap(~chr, ncol = 5, nrow = 2, scales = "free_x")

enter image description here

This:

ggplot(df1, aes(position, log10dist)) + 
  geom_point(shape=16, size=0.25) +
  stat_bkde2d(bandwidth=c(18036446, 0.05014539), 
              grid_size=c(128, 128), geom="polygon", aes(fill=..level..)) +
  scale_y_continuous(limits=c(3.5, 5.1)) +
  scale_fill_viridis() +
  facet_wrap(~chr, ncol = 5, nrow = 2, scales = "free_x") +
  theme_bw() +
  theme(panel.grid=element_blank())

enter image description here

gets you very close to the defaults smoothScatter() uses, but hackishly accomplishes most of what the nrpoints filtering code does in the smoothScatter() function solely by restricting the y axis limits.

hrbrmstr
  • 77,368
  • 11
  • 139
  • 205
  • 1
    Wow, looks great. For my actual data, the geom_hex seems to work best. (But I think the code you posted is wrong one?!) The two other solutions are not "fine" enough and result in poor resolution. Is it possible to increase the sensitivity for the `ggalt` solution, so that the individual clusters become visible? Also: how do I change it to log scale or `..levels..`. ... I tried `..count..` but it doesn't work. – Peer Wünsche Aug 19 '16 at 14:38
1

Call me oldschool, but why not use panel.smoothScatter from package latticeExtra. It provides direct access to smoothScatter but given that it is a panel function it automatically applies it to each subset of the defined panels. You say you need "facetting" so lattice is an obvious choice as it is explicitly designed to produce small multiples (i.e. facets or, in lattice-speak, panels). Panels can easily be created with y ~ x | g, where g is the variable used to define the small multiples. For your example, this would simply be:

library(latticeExtra)

chr <- c(rep(1:10,1000))
position <- runif(10000, min=0, max=5e8)
distance <- runif(10000, min=1, max=1e5)
log10dist <- log10(distance)

df1 <- data.frame(chr, position, distance, log10dist)

clrs <- colorRampPalette(brewer.pal(9, "Reds"))

xyplot(log10dist ~ position | chr, data = df1,
       panel = panel.smoothScatter, layout = c(5, 2),
       as.table = TRUE)

This way you get full control over the smoothing function, no hacking needed.

TimSalabim
  • 5,604
  • 1
  • 25
  • 36
0

Although it might be computational intensive to generate a plot with probably millions of points, here is a solution to color each point based on its local density (i.e. a ‘pseudo-colored’ dot plot).

Generic function to calculate the local density (comparably fast).

densVals <- function(x, y = NULL, nbin = 128, bandwidth, range.x) {
  dat <- cbind(x, y)
  # limit dat to strictly finite values
  sel <- is.finite(x) & is.finite(y)
  dat.sel <- dat[sel, ]
  # density map with arbitrary graining along x and y
  map   <- grDevices:::.smoothScatterCalcDensity(dat.sel, nbin, bandwidth)
  map.x <- findInterval(dat.sel[, 1], map$x1)
  map.y <- findInterval(dat.sel[, 2], map$x2)
  # weighted mean of the fitted density map according to how close x and y are
  # to the arbitrary grain of the map
  den <- mapply(function(x, y) weighted.mean(x = c(
    map$fhat[x, y], map$fhat[x + 1, y + 1],
    map$fhat[x + 1, y], map$fhat[x, y + 1]), w = 1 / c(
    map$x1[x] + map$x2[y], map$x1[x + 1] + map$x2[y + 1],
    map$x1[x + 1] + map$x2[y], map$x1[x] + map$x2[y + 1])),
    map.x, map.y)
  # replace missing density estimates with NaN
  res <- rep(NaN, length(sel))
  res[sel] <- den
  res
}

Apply this on each point given the grouping of chromosomes.

library(dplyr)
library(ggplot2)

df1 %>% group_by(chr) %>% mutate(point_density = densVals(position, log10dist)) %>% 
  arrange(chr, point_density) %>% 
  ggplot(aes(x = position, y = log10dist, color = point_density)) +
  geom_point(size = .5) +
  scale_color_viridis_c() +
  facet_wrap(vars(chr), ncol = 5, scales = "free_x")

(pseudo-colored dot plot)

ninjaminb
  • 63
  • 1
  • 6