0

I need to produce some data which has starting times of each job (# of jobs: 30), I do not have chance to get real data so how can I generate data which shows similarities with a data distribution. In this case, which distribution should be good to go on?

pjs
  • 18,696
  • 4
  • 27
  • 56
atco35
  • 5
  • 5

1 Answers1

0

A common technique used in simulation models where you don't have any data yet (e.g., data is very expensive, or it's a prospective system that does not even exist yet so where would you get the data from?) is to use a triangular distribution parameterized by subject matter experts (or your own best guesses) about the smallest, largest, and most common value you might see.

A relatively new, but quite powerful extension to this would be to vary the parameter choices in a designed set of experiments to see how much it matters if your guesstimates are off. A well-designed experiment would allow you to assess and characterize how much your results change as a function of the parameter values.

A more comprehensive variant would be to incorporate the distribution choice itself (triangle vs exponential vs anything else you think is plausible) into the design, to see whether that makes much of a difference. In the happy event that it doesn't, you can freely use a simple and convenient distribution choice such as the triangle; if it makes a big difference, you now have certain knowledge that you should get your hands on real data ASAP, because without that data based knowledge you're operating in a garbage-in-garbage-out mode. This also assumes that you control for, say, the first two moments as you switch between distribution choices so that your experiments are testing the shape of the distribution rather than the effect of mean and variance of the distribution.

If designed experiments tell you it doesn't much matter, that's wonderful news. If it does matter, you now know more about the system than you did before and know where to focus your efforts going forward.

pjs
  • 18,696
  • 4
  • 27
  • 56
  • Thanks, I will definitely check the triangualr distribution. What do you think about clone the real data in a simulator? I will check some delays of flights and I need to clone my real data which has 20 flights. It is insufficient to test GAMS/Cplex and smulated annealing. So, I think to clone real data and increase the number of flights with distributiopn applied by simulator accodring to restericted number of real data. What do you think about that? – atco35 Jul 01 '20 at 17:59
  • Using trace data, i.e., actual observations, is limiting in two major ways: 1) you can't simulate more data than you have, and 2) you can't simulate systems that don't yet exist. Limit #2 also applies to bootstrapping a small sample (resampling from it), and can yield results which are biased. For example, if you rolled a 6-sided die 5 times - there is at least one value you'll never see. Even if you have more than 6 samples, it's entirely possible to get {1, 3, 1, 1, 5, 5}. Bootstrapping from those observations will miss half of the possible values and overemphasize 1's and 5's. – pjs Jul 01 '20 at 18:56
  • Actually, you are right, but my data shows distribution of GEV when I check it easyfit programme, and if it was normal distribution, I could increase my data size by applying normal distribution however to provide GEV dist. while increasing data size is difficult as I understand. So, I am confused about my data size is not enoguh and generate all data with i.e. triangular dist. is not based any real data. – atco35 Jul 01 '20 at 19:02
  • I can guarantee you that a normal distribution won't be the right choice for start times, because its support is the entire number line, i.e., it can go negative. Whether you're generating the actual times or the time between successive starts, that's not good. – pjs Jul 01 '20 at 19:10
  • People generally model the time between start events, and that's what I was thinking you wanted when I discussed the the triangle distribution. – pjs Jul 01 '20 at 19:14
  • That is a good point for normal dist. What can you suggest if you have a data with GEV but you will not increase its size i.e. 100 flights with GEV distribution, how do you go to do it best? I am a little bit far away to generate data as regards its base distribution. – atco35 Jul 01 '20 at 19:14
  • I'd say 1) make sure you're fitting the right thing (inter-event times rather than event times), and 2) look at a histogram of your data and use your common sense and judgement rather than blindly accepting an automated fitting result, particularly one based on a small sample. GEV will pop up because it's flexible, not because it necessarily makes any particular sense. – pjs Jul 01 '20 at 19:16
  • I have arrival time of 20 flights and I will add some flights between some of them to test my model in terms of delay saving performance. I will follow your guidance about distribution. Thanks for your time, stay safe. – atco35 Jul 01 '20 at 19:20
  • Now, I tried to difference each time instead of using time on histogram, the differences between them, I mean time of job2 - time of job1, time of job 3 - time of job 2 etc. and I got right skewed histogram now I can go thorugh to generate interval between to job so I can add this duration last job time to get a new job which is based on real data inter-event times. I guess I am following a better path now – atco35 Jul 01 '20 at 19:31
  • Right skewed inter-event times is quite common, and exponential distributions are often used for convenience. If the events (start times) are independent of each other, as in some arrival processes, that makes sense, but bear in mind that the mode (highest density) for an exponential is zero. Gammas make more sense in many cases, and the triangle is a bounded approximation to that general shape - zeros unlikely, but a lot of cases with small values but a long right tail. In other words, right skewed... – pjs Jul 01 '20 at 19:39