0

I did some digging around, but I'm still very new to the concept of latin hypercube sampling. I found this example which uses the lhs pacakge:

set.seed(1)
randomLHS(5,2)

           [,1]       [,2]
[1,] 0.84119491 0.89953985
[2,] 0.03531135 0.74352370
[3,] 0.33740457 0.59838122
[4,] 0.47682074 0.07600704
[5,] 0.75396828 0.35548904

From my understanding, the entries in the resulting matrix are the coordinates of 5 points that will be used to determine combinations of two continuous variables.

I'm trying to do a simulation with 5 categorical variables. The number of levels per variable range from 2 to 5. This results in 2 x 3 x 4 x 2 x 5 = 240 scenarios. I'd like to cut it down as much as possible so I was thinking of using a latin hypercube, but I'm confused about how to proceed. Any ideas would be much appreciated!

Also, do you know of any good resources that explains how to analyze the results from latin hypercube sampling?

Maria Reyes
  • 369
  • 3
  • 18

1 Answers1

3

I'd recommend sticking with the full factorial with 240 design points, for the following reasons.

  1. Heck, this is what computers are for—to automate tedious computational tasks. 240 design points is nothing, you're doing this on a computer! You can easily automate the process with nested loops iterating through the levels, one loop per factor. Don't forget an innermost loop for replications. If each simulation takes more than a minute or two, break it across multiple cores or multiple machines. One of my students recently did this for his MS thesis work, and was able to run more than a million simulated experiments over a weekend.

  2. With continuous factors, you generally assume some degree of smoothness in the response surface and infer/project the response between adjacent design points based on regression. With categorical data, inference isn't valid for excluded factor combinations and interactions may very well be the dominant effects. Unless you do the full factorial, the combinations you omit may or may not be the most important ones, but the point is that you'll never know if you didn't sample there.

In general, you use the same analysis tools you would use if you were doing any other kind of sampling—Regression, logistic regression, ANOVA, partition trees,... For categorical factors, I'm a fan of partition trees.

pjs
  • 18,696
  • 4
  • 27
  • 56
  • thanks for your input. I want to provide a little more detail about my simulation. The 240 scenarios represent 240 populations (each with a hundread or so points). From each of these populations, I'm going to resample 100,000 times to obtain an estimate of bias and variance. Then I'm going to analyze the bias and variance across the different populations. I'm worried that my simulation might be too big, but perhaps I'll try to do the work in parallel like you suggested. Also, thank you for your explanation about analyzing the design points. – Maria Reyes Jul 04 '15 at 01:23