1

I try to assess the combined uncertainties related to different input parameters using Markov Chain Monte Carlo method in R. In other words, using the uncertainty parameters reported in input data documentation, I try to generate distributions for each of the datasets by creating 1000 random values within the used distribution (normal distribution or truncated normal distribution). However, I don't know how to do this with the purrr::map() functions faster and without exhausting the RAM. The dataset has 2m rows and 80 cols. Here is a simplified example:

library(tidyverse); library(truncnorm);library(data.table); library(dtplyr)
n <- 1000 # number of simulations
n_obs <- 10000 # number of observations. Does not work if e.g. 50000

Create a data.frame

dt <- data.frame(
  var1 = runif(n_obs, 0, 100),
  var2_low = runif(n_obs, 0, 1),
  var2_mean = runif(n_obs, 0, 5),
  var2_up = runif(n_obs, 0, 10)
  )

Convert to lazy data table to speed things up

dt1 <- dt %>% as.data.table() %>%
  lazy_dt(., immutable = FALSE)

Simulate

dt_sim <- dt1 %>% 
  mutate(mean_val =  rep(1, nrow(.)), # just row of 1 
         var1_rnorm = map(.x = mean_val,~rnorm(n, mean = .x, sd = 0.10)), # normal distribution with given sd
         sim_var1 = map2(.x = var1, .y = var1_rnorm, ~(.x*.y))) %>%  # multiply the data with simulated distribution
  
  # add truncated normal distribution for each row (var2)
  mutate(sim_var2 = pmap(.,~ rtruncnorm(n, 
                                        a = dt$var2_low, 
                                        b = dt$var2_up,
                                        mean =dt$var2_mean))) %>% 
 
  # multiply simulated variables sim_var1 and sim_var2
  mutate(sim_vars_multiplied = 
           pmap(list(x = .$sim_var1,
                     y = .$sim_var2),
                function(x,y) (x*y))) %>% 
  
  # derive coefficient of variation
  mutate(var_mean =map(.x = sim_vars_multiplied, ~ mean(.x, na.rm = TRUE)),
         var_sd = map(.x = sim_vars_multiplied, ~ sd(.x, na.rm = TRUE)),
         var_cv = unlist(var_sd) / unlist(var_mean)) %>% 
  
  # select only the variables needed
  dplyr::select(var_cv)
  

# collect the results
sim_results <- dt_sim %>% as.data.table() 

1 Answers1

2

This may help

library(data.table)
#n <- 1000 # number of simulations
n_obs <- 10000 
dt <- data.table(
  var1 = runif(n_obs, 0, 100),
  var2_low = runif(n_obs, 0, 1),
  var2_mean = runif(n_obs, 0, 5),
  var2_up = runif(n_obs, 0, 10)
  )

dt[, mean_val := rep(1,.N)]
dt[, var1_rnorm := rnorm(.N, mean = mean_val, sd = 0.10)]
dt[, sim_var1 := var1 * var1_rnorm]

dt[, sim_var2 := truncnorm::rtruncnorm(.N, a = var2_low, b = var2_up, mean = var2_mean)]
dt[, sim_vars_multiplied := sim_var1 * sim_var2]

dt[, var_mean := mean(sim_vars_multiplied, na.rm=TRUE)]
dt[, var_sd := sd(sim_vars_multiplied, na.rm=TRUE)]
dt[, var_cv := var_sd / var_mean]

sim_results <- dt[,var_cv]
Peace Wang
  • 2,399
  • 1
  • 8
  • 15
  • This works fast, but how to define that the number of simulations is exactly 1000 with .N? I also wonder what is the usefulness of dtplyr-package if the pure data.table syntax works so much better.. Thanks! – a_PhD_researcher Sep 20 '21 at 04:40
  • If this is useful, you can accept and upvote this answer :> – Peace Wang Sep 20 '21 at 04:56
  • 1. `.N` is the `nrow` of `dt`. The lenth of `var1_rnorm` is `10000`, so the result of `rnorm(.N, mean = mean_val, sd = 0.10)]` should have the same length (`1000` is puzzling, or you can `rep` it with the time `n_obs/n`, e.g. `rep( rnorm(n, mean = mean_val, sd = 0.10), n_obs/n)`). – Peace Wang Sep 20 '21 at 05:00
  • 2. To check the overview of dtplyr, you will find "dtplyr provides a data.table backend for dplyr. The goal of dtplyr is to allow you to write dplyr code that is automatically translated to the equivalent, but usually much faster, data.table code." – Peace Wang Sep 20 '21 at 05:00