10

I have the data as below and i need to identify the distribution of the data. pls help.

 x <-  c(37.50,46.79,48.30,46.04,43.40,39.25,38.49,49.51,40.38,36.98,40.00,38.49,37.74,47.92,44.53,44.91,44.91,40.00,41.51,47.92,36.98,43.40)
Roland
  • 127,288
  • 10
  • 191
  • 288
Vanathaiyan S
  • 215
  • 1
  • 2
  • 13
  • 3
    Please define (with some rigor regarding statistical language) what you mean exactly by "identify the distribution of the data". – Roland Jul 31 '15 at 08:46
  • 1
    What do you mean by "identify the distribution"...You can use `hist(x)` to see its shape. In terms of "rigorous proof" (actually never rigorous...), do hypothesis test.. – Ping Jin Jul 31 '15 at 08:46
  • 1
    This seems more a statistic than a programming question. Please OP clarify what you are trying to do. – nicola Jul 31 '15 at 08:47
  • 2
    I think OP is looking for a tool that will identify which known distribution describes the data best. – Roman Luštrik Jul 31 '15 at 09:20
  • Is there a function/code/package which can automatically identify the distribution of the given data? – Vanathaiyan S Jul 31 '15 at 11:26
  • It is more a statistic than a programing question, however to me the question is very valid as it asks for functions or programing modus to find the distribution of data. However more inside on the question is needed. – Barnaby Aug 08 '15 at 00:04

2 Answers2

29

A neat approach would involve using fitdistrplus package that provides tools for distribution fitting. On example of your data.

library(fitdistrplus)
descdist(x, discrete = FALSE)

enter image description here

Now you can attempt to fit different distributions. For example:

normal_dist <- fitdist(x, "norm")

abs subsequently inspect the fit:

plot(normal_dist)

Fitting


As a generic point I would suggest that you have a look at this discussion at Cross Validated, where the subject is discussed at lengths. You may be also willing to have a look at a paper by Delignette-Muller and Dutang - fitdistrplus: An R Package for Fitting Distributions, available here if you are interested in a more detailed explanation on how to use the Cullen and Frey graph.

Community
  • 1
  • 1
Konrad
  • 17,740
  • 16
  • 106
  • 167
  • 3
    The link to the discussion on CV is a very important part of this answer. (+1) – Roland Jul 31 '15 at 12:29
  • 1
    How to interpret this cullen and frey graph – Vanathaiyan S Aug 05 '15 at 05:07
  • 2
    @VanathaiyanS the **CF** graph is comparing skew and kurtosis of the given distribution to the specified distribution. I would suggest that you have a look at the linked discussion in CV, help file and the linked paper. To summarise/oversimplify in a few words: * for some distributions, like normal, there is only one possible value for the skewness and the kurtosis so there is a point on the graph. For other distributions the areas of possible values are represented. This is very much simplified answer you should also consider other methods, but the **CF** graph is a good start. – Konrad Aug 05 '15 at 08:55
5

First, thing you can do is to plot the histogram and overlay the density

hist(x, freq = FALSE)
lines(density(x))

Then, you see that the distribution is bi-modal and it could be mixture of two distribution or any other.

Once you identified a candidate distribution a 'qqplot' can help you to visually compare the quantiles.

thothal
  • 16,690
  • 3
  • 36
  • 71