-1

I have vectors of different size, and I want to sample all of them equally (for example 10 sample of each vector), in a way that these samples represent each vector.

suppose that one of my vectors is

y=c(2.5,1,0,1.2,2,3,2,1,0,-2,-1,.5,2,3,6,5,7,9,11,15,23)

what are the 10 represntive points of this vector?

roan
  • 31
  • 8
  • 2
    What do you mean representative of a vector? Very unclear to me. – Sotos Aug 14 '17 at 10:24
  • The points that are selected from the main vector, so that the shape of these 10 points are similar to the shape of the main vector. – roan Aug 14 '17 at 10:28
  • 2
    Still not clear. If you want SRS use function `sample(y, 10)`. If you further want sampling with replacement, use argument `replace = TRUE`. – Rui Barradas Aug 14 '17 at 10:31
  • using `sample` function, generate some points that are randomly selected, and I don't want that. I need the points that are important in creating the shape of the main vector, so the shape of these new points and original vector is the same. – roan Aug 14 '17 at 10:36
  • Then why sample at all? This does not make sense from a statistical point of view. – LAP Aug 14 '17 at 10:39
  • Yes, you are true. Thanks for your note. I modified my question. – roan Aug 14 '17 at 10:48
  • The word `representative` has as many meanings as people using it. It is not a distinctly defined term. So by `representative`, do you mean "Which points of the vector do best describe the underlying distribution"? – LAP Aug 14 '17 at 10:53
  • 1
    Please talk to a statistician. – Roland Aug 14 '17 at 10:53
  • 1
    maybe something like this: https://en.wikipedia.org/wiki/Ramer%E2%80%93Douglas%E2%80%93Peucker_algorithm ? – minem Aug 14 '17 at 10:59
  • @LAP, Yes, that is it. I used 10 quartiles for that aim. But it had some problems. I can bring here my code if it is necessary. – roan Aug 14 '17 at 10:59

5 Answers5

1

In case you are referring to retaining the shape of the curve, you can try preserving the local minimas and maximas:

df = as.data.frame(y)
y2 <- df %>%
  mutate(loc_minima = if_else(lag(y) > y & lead(y) > y, TRUE, FALSE)) %>% 
  mutate(loc_maxima = if_else(lag(y) < y & lead(y) < y, TRUE, FALSE)) %>%
  filter(loc_minima == TRUE | loc_maxima == TRUE) %>%
  select(y)

Though this does not guarantee you'll have exactly 10 points.

Aramis7d
  • 2,444
  • 19
  • 25
1

Thanks to @minem, I got my answer. Perfect!

library(kmlShape)

Px=(1:length(y))
Py=y    
par(mfrow=c(1,2))
    plot(Px,Py,type="l",main="original points")
    plot(DouglasPeuckerNbPoints(Px,Py,10),type="b",col=2,main="reduced points")

and the result is as below (using Ramer–Douglas–Peucker algorithm): sample reduction using

digEmAll
  • 56,430
  • 9
  • 115
  • 140
roan
  • 31
  • 8
1

The best answer has already been given, but since I was working on it, I will post my naive heuristic solution :

Disclaimer :
this is for sure less efficient and naive than Ramer–Douglas–Peucker algorithm, but in this case it gives a similar result...

# Try to remove iteratively one element from the vector until we reach N elements only.
# At each iteration, the reduced vector is interpolated and completed again
# using a spline, then it's compared with the original one and the
# point leading to the smallest difference is selected for the removal.
heuristicDownSample <- function(x,y,n=10){
  idxReduced <- 1:length(x)
  while(length(idxReduced) > 10){
    minDist <- NULL
    idxFinal <- NULL
    for(idxToRemove in 1:length(idxReduced)){
      newIdxs <- idxReduced[-idxToRemove]
      spf <- splinefun(x[newIdxs],y[newIdxs])
      full <- spf(x)
      dist <- sum((full-y)^2)
      if(is.null(minDist) || dist < minDist){
        minDist <- dist
        idxFinal <- newIdxs
      }
    }
    idxReduced <- idxFinal
  }
  return(list(x=x[idxReduced],y=y[idxReduced]))
}

Usage :

y=c(2.5,1,0,1.2,2,3,2,1,0,-2,-1,.5,2,3,6,5,7,9,11,15,23)
x <- 1:length(y)
reduced <- heuristicDownSample(x,y,10)

par(mfrow=c(1,2))
plot(x=x,y=y,type="b",main="original")
plot(x=reduced$x,y=reduced$y,type="b",main="reduced",col='red')

compare

digEmAll
  • 56,430
  • 9
  • 115
  • 140
0

Apparently you are interested in systematic sampling. If so, maybe the following can help.

set.seed(1234)
n <- 10
step <- floor(length(y)/n)
first <- sample(step, 1)
z <- y[step*(seq_len(n) - 1) + first]
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
0

You could use cut to generate a factor that indicates in which quintile (or whatever quantile you want) your values belong, and then sample from there:

df <- data.frame(values = c(2.5,1,0,1.2,2,3,2,1,0,-2,-1,.5,2,3,6,5,7,9,11,15,23))

cutpoints <- seq(min(df$values), max(df$values), length.out = 5)

> cutpoints
[1] -2.00  4.25 10.50 16.75 23.00

df$quintiles <- cut(df$values, cutpoints, include.lowest = TRUE)

> df
   values   quintiles
1     2.5   [-2,4.25]
2     1.0   [-2,4.25]
3     0.0   [-2,4.25]
4     1.2   [-2,4.25]
5     2.0   [-2,4.25]
6     3.0   [-2,4.25]
7     2.0   [-2,4.25]
8     1.0   [-2,4.25]
9     0.0   [-2,4.25]
10   -2.0   [-2,4.25]
11   -1.0   [-2,4.25]
12    0.5   [-2,4.25]
13    2.0   [-2,4.25]
14    3.0   [-2,4.25]
15    6.0 (4.25,10.5]
16    5.0 (4.25,10.5]
17    7.0 (4.25,10.5]
18    9.0 (4.25,10.5]
19   11.0 (10.5,16.8]
20   15.0 (10.5,16.8]
21   23.0   (16.8,23]

Now you could split the data by quintiles, calculate the propensities and sample from the groups.

groups <- split(df, df$quintiles)
probs <- prop.table(table(df$quintiles))
nsample <- as.vector(ceiling(probs*10))

> nsample
[1] 7 2 1 1

resample <- function(x, ...) x[sample.int(length(x), ...)]
mysamples <- mapply(function(x, y) resample(x = x, size = y), groups, nsample)
z <- unname(unlist(mysamples))

> z
 [1]  2.0  1.0  0.0  1.0  3.0  0.5  3.0  5.0  9.0 11.0 23.0

Due to ceiling(), this may lead to 11 cases being sampled instead of 10.

LAP
  • 6,605
  • 2
  • 15
  • 28