1

I loaded a data set called gob into R and tried the handy summary function. It is Note that the 3rd quartile is less than the mean. How can this be? Is it the size of my data or something else like that?

I already tried passing in a large value for the digits parameter (e.g. 10), and that does not resolve the issue.

> summary(gob, digits=10)

   customer_id         100101.D            100199.D            100201.D        
 Min.   :   1083   Min.   :0.0000000   Min.   :0.0000000   Min.   :0.0000000  
 1st Qu.: 965928   1st Qu.:0.0000000   1st Qu.:0.0000000   1st Qu.:0.0000000  
 Median :2448738   Median :0.0000000   Median :0.0000000   Median :0.0000000  
 Mean   :2660101   Mean   :0.0010027   Mean   :0.0013348   Mean   :0.0000878  
 3rd Qu.:4133368   3rd Qu.:0.0000000   3rd Qu.:0.0000000   3rd Qu.:0.0000000  
 Max.   :6538193   Max.   :1.0000000   Max.   :1.0000000   Max.   :0.7520278  

Note that for gob$100201.D the mean is 0.0000878 but the 3rd Qu. = 0.

mrk
  • 8,059
  • 3
  • 56
  • 78
Ed Fine
  • 717
  • 1
  • 6
  • 18
  • Note that [quartiles](http://en.wikipedia.org/wiki/Quartile) just divide your sample by number, not by value. – Xymostech Dec 06 '12 at 07:20
  • 1
    It's better to use a more descriptive title for your question, which really is "*Why is my 3rd quartile sometimes less than my mean when using summary() in R?*" (at which point, this becomes more of a question for [Cross Validated](http://stats.stackexchange.com/)). SO isn't really a place to post *possible* bug reports. Post your problem, and if it really is a bug, hopefully it gets noticed and fixed. See http://stackoverflow.com/a/10588698/1270695 for an example. The question has no mention of bugs, but the package maintainer identified it as such and filed a bug report where it belongs. – A5C1D2H2I1M1N2O1R2T1 Dec 06 '12 at 09:04
  • This is not an `R` question, as the answers show. – Carl Witthoft Dec 06 '12 at 12:28

2 Answers2

14

It is not a bug, just your data contains lot of 0 values. For example, if I make x with twelve 0 and one 1, I get result that 3rd quartile is smaller than mean

 x<-c(0,0,0,0,0,0,0,0,0,0,0,0,1)
summary(x)

  Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.00000 0.00000 0.00000 0.07692 0.00000 1.00000 

Try to use table() on your column to see distribution of values

table(x)
 x
 0  1 
 12  1 
Didzis Elferts
  • 95,661
  • 14
  • 264
  • 201
  • 8
    You need to accept Didzis' answer now by clicking the tick mark. By the way, this is similar to the logic that says most people have an above average number of legs... – Spacedman Dec 06 '12 at 08:27
5

The 3rd quantile can be lower than the mean. It's not 75% of the highest value, but the value at 75% of the count of a vector when ordered from lowest to highest. In other words:

Vector <- c(0,0,0,0,0,0,0,1)
mean(Vector)
[1] 0.125
quantile(Vector, 0.75)
[1] 0

To find the 3rd quantile, R orders all the data from lowest to highest, then picks the value closest to 75% of the length of that vector. So basically:

3rdQuar = Vector[round(length(Vector)*0.75)]

(Note that if it lands between two whole numbers, R will actually average the two. But this is the basic idea)

Señor O
  • 17,049
  • 2
  • 45
  • 47