Testing a power-law hypothesis for synthetic data

Question

I am trying to use maximum likelihood estimator to check the existence of power-law in certain synthetic dataset. I am following an approach described in this paper. In this approach, a vector of observations x is fed to the code and then code tells the confidence level (p-value) with which the fed data would have come from power-law distribution. For a single dataset this is really straightforward. However, now I am trying to use the same code for a slightly different situation. So I am doing many (say 100) random simulations of a certain process and each returns me a vector x of length 1000. Then I average over distributions of all these 100 realizations to find the average x whose distribution looks roughly straight on log-log plot. To find the p-value using the above code, I must feed the vector of observations corresponding to the averaged distribution. However, here I am running into problem. Initially I just multiplied the average distribution by 1000 and took the nearest integer of this product as the frequency of observation of certain value. But sometimes a certain value occurs in very few of the 100 realizations and then the corresponding value does not appear at all when I construct the vector. Thus I loose all the values which lie in the tail of the distribution. Is there a better way to calculate p-value from such averaged distribution to test the power-law hypothesis?

I'm voting to close this question as off-topic because this question is about statistics — csgillespie, Sep 24 '15 at 14:10

score 0 · Answer 1 · answered May 06 '15 at 14:44

0

So to summarise, you're trying to take the best fit from a hundred realisations of data? As the data's simulated I imagine the noise is constant across all simulations and you've the same number in each, so each realisation carries the same weight? In which case lump them all together and calculate a y for each x based on the parameters (assuming you're just converting a power plot to y=mx+ c) m and c and ask based on the noise in the sample, what is the probability of the simulated value. Multiple these together for all x and then repeat for different values of m and c (might want to look at Gibbs sampling). You can then use the values of m and c which give you the highest probability.

answered May 06 '15 at 14:44

James

1,764
5
31
49

I think this is not answer to my question. What is the meaning of 'lumping' them? I take their average. Also, I am not asking for a particular method which gives a best fit; I already know one (maximum likelihood estimation). – Peaceful May 06 '15 at 17:12
lumping= put all your data points together. English slang, sorry. I don't see why you can't do MLE on all your data points at the same time, so don't average. This would solve your missing point problem, which could bias your fit considerably. This might be a better question for cross validated. – James May 06 '15 at 22:04

Testing a power-law hypothesis for synthetic data

1 Answers1