0

Suppose that I have a data set S that contains the service time for different jobs, like S={t1,t2,t3,...,tn}, where ti is the service time for the ith job; and n the total number in my data set. This S is only a sample from a population. n here is 300k. I would like to study the impact of long service time as some jobs takes very long and some not. My intuition is to study this impact based on data gathered from real system. The system in study has thousands of millions of jobs, and this number is increasing by 100 new jobs each several seconds. Also, service time is measured via benchmarking the jobs on a local machine. So practically it is expensive to keep expanding your data set. Thus, i decided to randomly pick up 300k.

I am conducting simulation experiments where I have to generate a large number of jobs with their service times (say millions) and then do some other calculations.

How to use S as a population in my simulation, I came across the following:

1- use S itself. I could use bootstrapping 'sample with replacement' or ' sample without replacement'.

2- fit a theoretical distribution model to S and then draw from it.

Am I correct? which approach is best (pros and cons)? the first approach seems easy as just picking a random service time from S each time? is it reliable? Any suggestion is appreciated as I am not got at stats.

MWH
  • 353
  • 1
  • 3
  • 18

1 Answers1

1

Quoting from this tutorial in the 2007 Winter Simulation Conference:

At first glance, trace-driven simulation seems appealing. That is where historical data are used directly as inputs. It’s hard to argue about the validity of the distributions when real data from the real-world system is used in your model. In practice, though, this tends to be a poor solution for several reasons. Historical data may be expensive or impossible to extract. It certainly won’t be available in unlimited quantities, which significantly curtails the statistical analysis possible. Storage requirements are high. And last, but not least, it is impossible to assess “what-if?” strategies or try to simulate a prospective system, i.e., one which doesn’t yet exist.

  1. One of the major uses of simulation is to study alternative configurations or policies, and trace data is not suitable for that—it can only show you how you're currently operating. Trace data cannot be used for studying systems which are under consideration but don't yet exist.
  2. Bootstrapping resamples your existing data. This removes the data quantity limitations, but at a potential cost. Bootstrapping is premised on the assumption that your data are representative and independent. The former may not be an issue with 300k observations, but often comes up when your sample size is smaller due to cost or availability issues. The latter is a big deal if your data come from a time series where the observations are serially correlated or non-homogeneous. In that case, independent random sampling (rather than sequential playback) can lose significant information about the behaviors being studied.
  3. If sequential playback is required you're back to being limited to 300k observations, and that may not be nearly as much data as you think for statistical measures. Variance estimation is essential to calculating margins of error for confidence intervals, and serial correlation has a huge impact on the variance of a sample mean. Getting valid confidence interval estimates can take several orders of magnitude more data than is required for independent data.

In summary, distribution fitting takes more work up front but is usually more useful in the long run.

pjs
  • 18,696
  • 4
  • 27
  • 56
  • Thanks, very useful. I added some text to explain more. For 2, i know bootstrap assumes my data is representative, but i am not sure whether my data is representative as i sampled 300k from a huge data set (thousands of millions). Do you mean bootstrapping works with small samples, not large like mine?. Bootstrapping is easy and generate many samples as you are replacing the items. For fitting a distribution, does not this generate jobs that may not fall within the data set, and thus the population set? I am still not sure which approach best suited to my case, even with some drawbacks. – MWH Jul 03 '19 at 20:22