Suppose that I have a data set S that contains the service time for different jobs, like S={t1,t2,t3,...,tn}
, where ti is the service time for the ith job; and n the total number in my data set. This S is only a sample from a population. n here is 300k. I would like to study the impact of long service time as some jobs takes very long and some not. My intuition is to study this impact based on data gathered from real system. The system in study has thousands of millions of jobs, and this number is increasing by 100 new jobs each several seconds. Also, service time is measured via benchmarking the jobs on a local machine. So practically it is expensive to keep expanding your data set. Thus, i decided to randomly pick up 300k.
I am conducting simulation experiments where I have to generate a large number of jobs with their service times (say millions) and then do some other calculations.
How to use S as a population in my simulation, I came across the following:
1- use S itself. I could use bootstrapping 'sample with replacement' or ' sample without replacement'.
2- fit a theoretical distribution model to S and then draw from it.
Am I correct? which approach is best (pros and cons)? the first approach seems easy as just picking a random service time from S each time? is it reliable? Any suggestion is appreciated as I am not got at stats.