2

i'm having trouble at finding a way to calculate faster the median and mean of a large vector in R. How would I implement a faster way? I'm doing the code above, but its too slow. I'm thinking about parallel processing, but i have no ideia how to make this work. Thanks.

    vector <- 1:10000000000
    m <- mean(vector)
    md <- median(vector)
Hysteresis
  • 23
  • 3

1 Answers1

0

Assuming we're dealing with a sequential integer vector 1:n. This may help you:

## Given
V <- 1:10e8    
n <- length(V)

## To get median,
median <- ifelse(n %% 2 == 0, mean(V [(n/2):((n/2) + 1)]), V [(n + 1)/2])
median
OUTPUT: 5e+08

## To get mean,
sum_series <- n*(n + 1) / 2    # Mathematical Fact
mean <- sum_series / n
mean
OUTPUT: 5e+08

For large random vectors, the median still works the same. The mean you can estimate if it doesn't have a closed formula:

### Estimation via Repeated Sampling ### 
est_mean <- function (V, k, size) {
  # k: Number of means to use in estimation
  # size: Sample size of each estimation  
  est <- rep(NA, k)
  samp <- matrix(NA, nrow = size, ncol = k)

  for (j in 1:k) samp [, j] <- sample(V, size, replace = TRUE)
  for (j in 1:k) est [j] <- mean(samp [, j])
  est <- sort(est)

  return(est [ceiling(length(est)/2)])
}

### Time Complexity of Estimation ### 
# samp + est = k*size + k 
#     If size, k ~ 30 --> Enough to get normal mean distribution
# iterate amount*(create sample vector + store) = k*(size + size)
#     --> 2*k*size 
# Total = k + 3*k*size --> constant

### Time Complexity of Base R Mean () ###
# Assuming it's this: mean (V) <- sum(V)/length(V)
# sum N items + find length + 1 division + 1 return = N + 3


### Example ###
set.seed(0)
V <- sort(sample(0:10e8, 10e7, replace = TRUE))

start1 <- Sys.time()
est_mu <- est_mean(V, 1000, 30)
end1 <- Sys.time()
diff1 <- end1 - start1

start2 <- Sys.time()
r_mu <- mean (V)
end2 <- Sys.time()
diff2 <- end2 - start2

diff1
OUTPUT: Time difference of 0.08370018 secs
diff2
OUTPUT: Time difference of 0.5321879 secs

print(paste("% Difference = ", abs(r_mu - est_mu)/r_mu))
OUTPUT: "% Difference =  0.00678363793285072"
Tam Le
  • 354
  • 3
  • 9
  • I noticed that median = mean for vector integer sequence (i.e. 1:n, where n = 10000000000 in your case). If that holds everytime, you can just compute one of them (probably mean because it's quicker) and equate to get the other. – Tam Le Aug 10 '18 at 23:20
  • Thanks, but, what if my vector is different from 1:n? Maybe I wasn't clear enough, but what can I do in cases of random numbers in a vector that large? – Hysteresis Aug 10 '18 at 23:56
  • The way to compute the median will still be the same as in the above, just remember to sort your vector. The mean is trickier, unless you can find a close summation form. According to this, https://rstudio-pubs-static.s3.amazonaws.com/43533_5ab8384442864c7d944fb917957da9cb.html, parallel sum is not that better than base r sum(). – Tam Le Aug 12 '18 at 07:31
  • I couldn't figure out how to calculate mean exactly quickly, but you can estimate it pretty close and fast. I provided added a way to do that. Hope it is good enough for you. – Tam Le Aug 14 '18 at 04:46