2

I have a graph where the extrapolation does not match the initial interpolation. I would like the heatmap to fill the entire image.

First, the interpolation code:

library(akima)
library(reshape2)

xmin <- signif(min(CBLo2$MD1))
xmax <- signif(max(CBLo2$MD1))
ymin <- signif(min(CBLo2$MD2)) 
ymax <- signif(max(CBLo2$MD2))
gridint <- 100

fld <- with(CBLo2, interp(x = MD1, y = MD2, z = Abundance, 
            xo=seq(xmin, xmax, length=gridint), yo=seq(ymin, ymax, length=gridint) ))
df <- melt(fld$z, na.rm = TRUE)
names(df) <- c("MD1", "MD2", "Abundance")
df$MD1 <- fld$x[df$MD1]
df$MD2 <- fld$y[df$MD2]
contour(fld) # test plot

I won't post the entire ggplot code (used for the plot below), just that needed to produce the heatmap:

ggplot() +
  geom_tile(inherit.aes=FALSE,data = df, aes(x = MD1, y = MD2,fill = Abundance)) +
  scale_fill_continuous(name = "Rain (mm)", low = "yellow", high = "green")

First plot

However, when I try to extrapolate the data (following examples from other posts), I get the following plot, which does not match the first well at all:

fld <- with(CBLo2, interp(x = MD1, y = MD2, z = Abundance, extrap=TRUE, linear=FALSE,
            xo=seq(xmin, xmax, length=gridint), yo=seq(ymin, ymax, length=gridint) ))

enter image description here

Here is the data:

Abundance   MD1 MD2
9   -0.59042    0.76793119
42  -0.48544284 -0.09465043
13  0.51250586  -0.24599322
84  -0.30857525 -0.21529624
2   0.90449257  0.679926
16  0.24536209  0.24016424
52  -0.43144002 -0.75474149
4   1.23830339  -0.11985391
37  -1.10235817 0.33886773
79  0.01757236  -0.59635386

What am I doing wrong? How can I make the extrapolation more accurate?

Grubbmeister
  • 857
  • 8
  • 25

1 Answers1

2

TLDR solution:

Add linear = FALSE to all the interp() codes for consistency, and specify the same limits in scale_fill_continuous().

Explanation:

There are two issues here.

Issue 1: the code used to generate the first fld does not include the parameter linear = FALSE, while the code used for the second does.

Let's compare the interpolated values:

library(dplyr)

fld1 <- with(CBLo2, 
            interp(x = MD1, y = MD2, z = Abundance, 
                   xo=seq(xmin, xmax, length=gridint), 
                   yo=seq(ymin, ymax, length=gridint) ))
df1 <- melt(fld1$z, na.rm = TRUE) # 6426 obs

fld2 <- with(CBLo2, 
             interp(x = MD1, y = MD2, z = Abundance, 
                    extrap = TRUE, linear = FALSE,
                    xo=seq(xmin, xmax, length=gridint), 
                    yo=seq(ymin, ymax, length=gridint) ))
df2 <- melt(fld2$z, na.rm = TRUE) #1000 obs

df.combined <- left_join(df2, df1, by = c("Var1", "Var2"))
df.combined %>% 
  filter(!is.na(value.y)) %>%        # compare for the overlapping range
  mutate(diff = value.x - value.y) %>%
  select(diff) %>% 
  summary()

      diff         
 Min.   :-303.360  
 1st Qu.: -42.399  
 Median :   8.763  
 Mean   :  -7.552  
 3rd Qu.:  36.132  
 Max.   : 238.647  

Now add linear = FALSE to the first fld:

fld3 <- with(CBLo2, 
            interp(x = MD1, y = MD2, z = Abundance, 
                   linear = FALSE,
                   xo=seq(xmin, xmax, length=gridint), 
                   yo=seq(ymin, ymax, length=gridint) ))
df3 <- melt(fld3$z, na.rm = TRUE) # 6426 obs

df.combined <- left_join(df2, df3, by = c("Var1", "Var2"))
df.combined %>% 
  filter(!is.na(value.y)) %>%
  mutate(diff = value.x - value.y) %>%
  select(diff) %>% 
  summary()

      diff  
 Min.   :0  
 1st Qu.:0  
 Median :0  
 Mean   :0  
 3rd Qu.:0  
 Max.   :0 

Issue 2: The range of interpolated values are very different.

# define column names
names(df2) <- c("MD1", "MD2", "Abundance")
names(df3) <- c("MD1", "MD2", "Abundance")

> range(df2$Abundance)
[1] -1136.341   420.369
> range(df3$Abundance)
[1] -297.9161  241.6618

We can see that even when values match at the same MD1/MD2 coordinates, the range of values in the expanded df2 far exceeds the range for df3. In order to ensure the same mapping between Abundance values and colours, we have to specify the fill limits based on the combined range of the two.

I'll use an ugly but visually distinct gradient to illustrate this:

library(gridExtra)

p <- ggplot() + 
  scale_fill_gradientn(name = "Rain (mm)", colours = rainbow(15),
                       limits = range(c(df2$Abundance, df3$Abundance)))

grid.arrange(p + geom_tile(data = df3, aes(x = MD1, y = MD2, fill = Abundance)),
             p + geom_tile(data = df2, aes(x = MD1, y = MD2, fill = Abundance)),
             nrow = 1)

plot

If we overlay the plots, they overlap exactly (transparency tweaked to show the edges for df3):

p + 
  geom_tile(data = df3, aes(x = MD1, y = MD2, fill = Abundance), alpha = 0.5) +
  geom_tile(data = df2, aes(x = MD1, y = MD2, fill = Abundance), alpha = 0.5)

plot2

Z.Lin
  • 28,055
  • 6
  • 54
  • 94
  • Thanks Z.Lin, that looks like it might help (I'll check when I get back to my computer). My only question is that I know that the first interpolation is correct, as I manually checked each data point. Including linear=FALSE appears to either to use an unsuitable interpolation method, or else skews the data by having numbers that are too high/low, and overwhelms the main pattern. – Grubbmeister Feb 02 '18 at 12:13
  • Essentially, the difference is one uses the linear method (which gives a more readable picture of my data), but you can't extrapolate the data. To do that, you need the non-linear method, but the visualisation doesn't distinguish high and low values very well, at least for my dataset. Presumably, this is because the data values are overwhelmed by non-linear interpolation values (the extremes become much greater with non-linear relative to linear methods). – Grubbmeister Mar 07 '18 at 22:38