0

I need to work with 3D data (spatial) very long tables with for coumns:

x, y, z, Value

There are too many data to be plotted with scatterplot3d or similar (rgl, lattice...)

I would like to reduce the number of data.
One idea could be to sample.

But I'd like to know how to reduce the data, getting new points that summarize the nearby points.

Is there any package to do it and work with this kind of data?

Something like creating a predefined 3D grid and averaging the points in each grid.

But I don't know whether it's better to choose the new points equidistants or just get their coordinates averaging the old ones locally. Or even weighting their final contribution with the distance to the new point.

Other issues:
The "optimal" grid could be tilted, but I don't know it beforehand. I don't know if the grid should be extended a little bit beyond the data nor how much.

PD: I don't want to create surfaces nor wireframes nor adjust anything. PD: I've checked spatial packages but as far as I see they are useful for data on a surface, such as the earth, but without height.

Community
  • 1
  • 1
skan
  • 7,423
  • 14
  • 59
  • 96
  • 1
    Maybe a clustering? E.g. `library(scatterplot3d); m <- replicate(3, runif(1000)); scatterplot3d(m[, 1], m[, 2], m[, 3], color="lightgray"); km <- kmeans(m, nrow(m)*.1); par(new=TRUE); scatterplot3d(km$centers[, 1], km$centers[, 2], km$centers[, 3], pch=3, color="red", cex.symbols=3, new=T)`. – lukeA Feb 17 '16 at 13:28
  • I know a little bit about the theory but haven't used any clustering package before. kmeans or hclust. How to get clusters of same size? How to get clusters with same number of points? – skan Feb 17 '16 at 19:07
  • K-means has problems with uneven size clusters. http://www.r-bloggers.com/k-means-clustering-is-not-a-free-lunch I could use dbscan or other methods – skan Feb 17 '16 at 19:26

3 Answers3

0

To reduce the size of the data set, have you thought about using a clustering methods such as kmeans or hierarchical clustering (hclust). These methods could reduce your data set down to a reasonable size. Be aware, if your data set is large enough these methods could still be too computational time consuming.

Dave2e
  • 22,192
  • 18
  • 42
  • 50
  • Yes but later the result is more difficult to explain and justify to other people. And I have many points (more than 100 million), maybe too much for R to run clustering models. – skan Feb 17 '16 at 16:25
  • Who gave his answer first, you or lukeA?. – skan Feb 17 '16 at 19:04
0

Seems like you might benefiit from fitting some sort of model to your data and then displaying the prediction on a resolution of your choice.

Here is an example of fitting with a GAM model:

library(sinkr) # https://github.com/marchtaylor/sinkr
library(mgcv)
library(rgl)


# make data ---------------------------------------------------------------
n <- 1000
x <- runif(n, min=-10, max=10)
y <- runif(n, min=-10, max=10)
z <- runif(n, min=-10, max=10)
value <- (-0.01*x^3 + -0.2*y^2 + -0.3*z^2) * rlnorm(n, 0, 0.1)


# fit model (GAM) ---------------------------------------------------------
fit <- gam(value ~ s(x) + s(y) + s(z))
plot.gam(fit, pages = 1)

enter image description here

This visualization is already helpful in understanding the 3d pattern of value, but you could also predict the values to a new grid. To visualize the prediction in 3d, the rgl package might be useful:

# predict to new grid -----------------------------------------------------
grd <- expand.grid(
  x=seq(min(x), max(x),,10),
  y=seq(min(y), max(y),,10),
  z=seq(min(z), max(z),,10)
)
grd$value <- predict.gam(fit, newdata = grd)

# plot prediction with rgl ------------------------------------------------
# original data
plot3d(x, y, z, col=val2col(value, col=jetPal(100)))
rgl.snapshot("original.png")

# interpolated data
plot3d(grd$x, grd$y, grd$z, col=val2col(grd$value, col=jetPal(100)), alpha=0.5, size=5)
rgl.snapshot("points.png")
spheres3d(grd$x, grd$y, grd$z, col=val2col(grd$value, col=jetPal(100)), alpha=0.3, radius=1)
rgl.snapshot("spheres.png")

enter image description here

Marc in the box
  • 11,769
  • 4
  • 47
  • 97
  • I don't like the idea of fitting a model yet, I prefer to work with raw data. Anyway I'll have a look at your sinkr suggestion . – skan Feb 17 '16 at 16:06
0

I've found the way to do it.
I'll post an example, just in case it's useful for others.
I write only two dimensions (and only working on the coordinates) to make it clear, but it can be generalized to higher dimensions and summarizing the functions at every coordinate).

set.seed(1)
xx <- runif(30,0,100); yy <- runif(30,0,100)  
datos <- data.frame(xx,yy)   #sample data
plot(xx,yy,pch=20)  #  2D plot to visualize it.

n <- 4  # Same number of splits on every axis. Simple example.
rango <- function(ii){(max(ii)-min(ii))+0.000001}
renorm<- function(jj) {trunc(n*(jj-min(jj))/rango(jj))+1}

result <- aggregate(cbind(xx,yy)~renorm(xx) + renorm(yy),datos, mean)
points(result$xx,result$yy,pch=20, col="red")
abline(v=(  min(xx) + (rango(xx)/n)*0:n) )
abline(h=(  min(yy) + (rango(yy)/n)*0:n) )

Everything could be modified with na.rm=T
Maybe there are a simpler solutions with split, cut, dplyr, data.table, tapply...
I like this way more than fixing the new points coordinates at the center of every subregion because if you have only 1 point it keeps its original coordinates.
The +0.000000001 is to avoid the last point to move to a subregion further.

The full solution would have been:

aggregate(cbind(xx,yy,zz, Value)~renorm(xx)+renorm(yy)+renorm(zz),datos, mean)

And it could be further improved by weighting distances.

enter image description here

skan
  • 7,423
  • 14
  • 59
  • 96