1

Could anyone please tell me the difference between the quantile function in R and the cut2 function from the HMISC package?

I understand that the quantile has 9 different methods for specifying quartiles. However, when I use the function cut2(mydata, g = 4), the quartiles which are output do not correspond to any of the quantile function outputs.

Any help greatly appreciated.

Thanks in advance.

Maeve90
  • 345
  • 1
  • 6
  • 14

1 Answers1

10

From the cut2 helpfile:

Function like cut but left endpoints are inclusive and labels are of the form [lower, upper), except that last interval is [lower,upper]. If cuts are given, will by default make sure that cuts include entire range of x.

So, cut2 is basically cut with a few different defaults. Let's look at cut then.

From the cut helpfile:

cut divides the range of x into intervals and codes the values in x according to which interval they fall. The leftmost interval corresponds to level one, the next leftmost to level two and so on.

From the quantile helpfile:

The generic function quantile produces sample quantiles corresponding to the given probabilities. The smallest observation corresponds to a probability of 0 and the largest to a probability of 1.

One cuts the range of x, the other cuts the "frequency" of x.

An illustration:

out <- 0:100
out2 <- c(seq(0, 50, 0.001), 51:100)

Both have the same range. From 0 to 100.

levels(cut(out,4, include.lowest = T))
[1] "[-0.1,25]" "(25,50]"   "(50,75]"   "(75,100]" 
levels(cut(out2,4, include.lowest = T))
[1] "[-0.1,25]" "(25,50]"   "(50,75]"   "(75,100]" 

But there are many more "datapoints" living in out2, in particular for values between 0 and 50. Therefore, they do not have the same frequencies along the range:

quantile(out)
  0%  25%  50%  75% 100% 
   0   25   50   75  100 
quantile(out2)
      0%      25%      50%      75%     100% 
  0.0000  12.5125  25.0250  37.5375 100.0000 

This is the difference between cut and quantile.

The above example also shows you when both agree, namely in the case of uniform distributions. The sequence from 0 to 100, for instance, is evenly distributed on the range from 0 to 100. Here, both are basically identical.

To illustrate even further, consider:

outdf <- data.frame(out=out, cut=cut(out,4, include.lowest = T))
out2df <- data.frame(out=out2, cut=cut(out2,4, include.lowest = T))

table(outdf$cut)
[-0.1,25]   (25,50]   (50,75]  (75,100] 
       26        25        25        25 
table(out2df$cut)
[-0.1,25]   (25,50]   (50,75]  (75,100] 
    25001     25000        25        25 

Here, you clearly see the different frequencies in each bin.

coffeinjunky
  • 11,254
  • 39
  • 57