1

I have data from two samples and I want to plot a frequency distribution plot in R. I have the reference done in Excel:

what in want to get in R (obtained with excel)

I uploaded in R the data (HistSerp). It's 136 obs. of 2 variables.

summary(HistSerp)
V1              V2       
 Min.   :0.000   Min.   :0.0000  
1st Qu.:0.000   1st Qu.:0.3752  
Median :0.000   Median :1.2845  
Mean   :0.055   Mean   :1.2144  
3rd Qu.:0.082   3rd Qu.:1.9952  
Max.   :1.082   Max.   :2.9800 

class(HistSerp$V1)
"numeric"
class(HistSerp$V2)
"numeric"

If I HistSerp.m <- melt(HistSerp) and ggplot(HistSerp.m) + geom_freqpoly(aes(x = value, y = ..density.., colour = variable)) the plot looks:enter image description here

I don't know why the y-axis span that values, and I'm not sure if it's only a y-axis labeling problem, the plot itself seems different. I've also tried geom_density() , hist(HistSerp$V1, freq=FALSE), etc. but I can't get it as I expect, I got the same as before. I guess there's something wrong with my data but I can't figure out what is it. Any help will be appreciated.

Thanks

Ps. should I copy the data (136x2)?

Update: The data. Sorry if there's a better way to copy it...

0.144   2.024
0.082   2.548
0.082   1.943
0.000   2.599
0.000   2.233
0.000   2.342
0.082   1.655
0.082   2.200
0.000   2.261
0.000   2.408
0.000   2.127
0.000   2.053
0.000   1.929
0.000   1.413
0.000   2.400
0.000   2.777
0.000   2.685
0.000   1.436
0.000   1.573
0.000   2.504
0.000   1.533
0.000   1.434
0.000   1.421
0.000   2.534
0.082   1.728
0.000   1.984
0.082   1.287
0.000   2.324
0.164   2.405
0.279   1.989
0.082   2.729
0.144   2.046
0.226   2.496
0.000   2.980
0.000   2.634
0.000   1.792
0.000   1.571
0.000   0.612
0.000   0.884
0.000   0.449
0.000   2.318
0.082   0.449
0.000   0.449
0.000   0.563
0.082   0.919
0.000   0.617
0.082   1.297
0.144   0.719
0.000   1.897
0.000   1.338
0.000   0.337
0.000   1.555
0.000   0.273
0.291   0.656
0.000   0.273
0.082   0.388
0.082   1.911
0.082   0.852
0.000   1.580
0.000   1.450
0.000   1.209
0.000   2.049
0.082   2.694
0.082   1.089
0.246   2.643
0.000   2.393
0.000   1.702
0.000   2.595
0.000   1.432
0.000   2.094
0.000   1.526
0.082   1.775
0.000   0.273
0.000   1.405
0.000   2.014
0.000   0.543
0.000   0.586
0.000   1.224
0.000   0.719
0.164   0.201
0.000   0.388
0.082   0.232
0.000   0.116
0.000   0.116
0.082   1.395
0.000   0.116
0.000   0.232
0.082   0.844
0.000   1.153
0.082   0.000
0.667   0.000
0.000   1.535
0.000   2.687
0.000   0.922
0.226   0.337
0.197   0.999
1.082   1.373
0.082   0.396
0.082   0.116
0.000   1.667
0.000   0.731
0.000   0.544
0.082   2.072
0.000   2.262
0.164   2.111
0.082   1.675
0.000   0.116
0.000   0.232
0.082   0.116
0.000   1.004
0.000   0.116
0.164   0.116
0.082   0.699
0.000   0.000
0.000   0.273
0.082   0.000
0.000   0.388
0.082   0.000
0.000   0.116
0.000   0.273
0.000   0.000
0.000   0.649
0.164   0.000
0.082   0.000
0.082   0.000
0.000   0.000
0.082   0.000
0.144   1.282
0.000   1.772
0.000   0.116
0.082   0.000
0.000   1.416
0.000   0.563
0.082   0.510
0.000   0.316
0.164   1.124
PGreen
  • 3,239
  • 3
  • 24
  • 29
  • Yep, adding the data would help. Might be that the problem is somewhere in there - and in any case it would allow tracing your steps and identifying where something goes wrong. – dlaehnemann May 24 '13 at 13:23
  • Please use the output of `dput(HistSerp)` to share the data in a useful way. – Roland May 24 '13 at 13:25
  • What is the bin width you used in excel? – Roland May 24 '13 at 13:28
  • @Roland, I saw your comment late. Is that fine like this? In excel I break every 0.25. – PGreen May 24 '13 at 13:29
  • Hard to say without more understanding of what exactly Excel thinks a "frequency polygon" is. A density is not bounded by one; it _integrates_ to one, so the curve itself can (and frequently does) stretch well above one. Whatever Excel is plotting, it isn't a density, since it clearly does not integrate to one. – joran May 24 '13 at 13:48
  • thanks @joran. Maybe I mislead both terms, what I need is frequency, that ranges from 0 to 1. (and just to say: In excel I did groups from 0 to 3 every 0.25 (13 groups,B1:B13) and did: FREQUENCY(A1:A136;B1:B13) and then divide each element the resulting table (which are counts) by 136, to get the frequency) – PGreen May 24 '13 at 13:56
  • In the first column of data, 132 of the 136 values are between 0 and 0.25. Based on that I can't understand how you could end up plotting a frequency value of around 0.6. – joran May 24 '13 at 14:06
  • for the first column of data, excel gives the following table: `0-->85`; `0.25-->47` (this 85+47=132 that you refer), `0.5 --> 2`, `0.75-->...` and so. When I divide this by 136, I get `0-->0.625`; `0.25-->0.345`, `0.5-->0.01`, etc. In excel plot you see y =~ 0.6 for the x=0 and y ~ 0.35 for x=0.25. Is this what you mean? – PGreen May 24 '13 at 14:13
  • Ok, so Excel is being bad. If you specify breaks of 0, 0.25, etc, you should get bins of equal sizes. But Excel is chopping off all the values equal to 0 in their own bin. You really, you have a single bin that is infinitely small, and then 13 others that are width 0.25. It suffices to say that this is, well, non-standard. – joran May 24 '13 at 14:17
  • Thanks @joran, that's interesting, I wasn't aware of that. Considering this, do you have any suggestion about how to plot this data in R, so that I get the frequencies in y-axis from 0 to 1? – PGreen May 24 '13 at 14:32

1 Answers1

3

You have a couple of options:

geom_freqpoly(aes(y = ..count.. / sum(..count..)))

which is probably what you want. Then there's:

geom_freqpoly(aes(y = ..ndensity..))

which is the density estimate, but scaled to range from 0 to 1. (i.e. it will always range from 0 to 1). And finally, the associated:

geom_freqpoly(aes(y = ..ncount..))

which is similar, but for the counts. You can read about the options at ?stat_bin.

joran
  • 169,992
  • 32
  • 429
  • 468
  • Great. Using your first option did what I was expecting :) Thanks! That's been a mini-master-class ;) – PGreen May 24 '13 at 15:12