So the real problem is that the S1.x
and S1.y
values only have one non-zero value in their columns. And it turns out that geom_density_2d
can't really estimate a density with only a value or two. But read on...
Update:
This question has been asked before, and the answers are usually that you need to have non-zero variance in your data columns. But you do have non-zero variance, so why isn't it working?
- Looking at the internals of
geom_density_2d
we see that it uses the MASS::kde2d
package function to calculate the distribution.
- Looking at
kde2d
we see that it uses MASS::bandwidth.nrd(df$x)
to get an estimate of the bandwidth.
- Looking at the help (which has the code) for
bandwidth.nrd
we see it uses a rule of thumb that gets the quantile
of the distribution, and subtracts the 2nd quantile from the 1st quantile to get a bandwidth estimate.
- Doing a quantile on your original data we see that the quantiles of the data were zero.
- And running
MASS::kde2d
on your original data with that bandwidth.nrd
estimate of the bandwidth gives you the same error:
library(MASS)
nn <- c("DQ459412","DQ459413","DQ459415","DQ459418","DQ459419","DQ459420")
s1x <- c(0,1.584963,0,0,0,0)
s1y <- c(0,2.358379,0,0,0,0)
s2x <- c(0,4.392317,0,0,4,0)
s2y <- c(0,3.085722,0,0,2.891544,0)
df <- data.frame(transcriptID=nn,S1.x=s1x,S1.y=s1y,S2.x=s2x,S2.y=s2y)
> quantile(df$s1x)
0% 25% 50% 75% 100%
0.000000 0.000000 0.000000 0.000000 1.584963
> quantile(df$s1y)
0% 25% 50% 75% 100%
0.000000 0.000000 0.000000 0.000000 2.358379
h <- c(MASS::bandwidth.nrd(df$x), MASS::bandwidth.nrd(df$y))
dens <- MASS::kde2d(df$s1x, df$s1y, h = h, n = n, lims = c(0,1,0,1))
Error in MASS::kde2d(df$s1x, df$s1y, h = h, n = n, lims = c(0, 1, 0, 1)) :
bandwidths must be strictly positive
So the real criteria for using geom_density_2D
is that both the x- and the y-data needs to have a non-zero gap between their 1st and 2nd quantiles.
Now to fix it, if I make a small modification - replacing one of the zeros with 0.1, like this:
nn <- c("DQ459412","DQ459413","DQ459415","DQ459418","DQ459419","DQ459420")
s1x <- c(0,1.584963,0,0,0.1,0)
s1y <- c(0,2.358379,0,0,0.1,0)
s2x <- c(0,4.392317,0,0,4,0)
s2y <- c(0,3.085722,0,0,2.891544,0)
df <- data.frame(transcriptID=nn,S1.x=s1x,S1.y=s1y,S2.x=s2x,S2.y=s2y)
print(df)
yielding:
transcriptID S1.x S1.y S2.x S2.y
1 DQ459412 0.000000 0.000000 0.000000 0.000000
2 DQ459413 1.584963 2.358379 4.392317 3.085722
3 DQ459415 0.000000 0.000000 0.000000 0.000000
4 DQ459418 0.000000 0.000000 0.000000 0.000000
5 DQ459419 0.100000 0.100000 4.000000 2.891544
6 DQ459420 0.000000 0.000000 0.000000 0.000000
Then I get this plot instead of your error.
You can let that 0.1
value approach zero, eventually it will not be able to calculate a distribution anymore and you will get your error again.
One general way to deal with this situation is to add a very small quantity of noise to your data, kind of simulating the fact that any meaningful calculation based on a real measurement from a continuous distribution should be impervious to that small quantity of noise.
Hope that helps.