2

I need to categorize a continuous variable in 4 classes each one with the same number of observations. I have used the function

cut(x, breaks = quantile(x,probs=seq(0,1,0.25)),include.lowest=TRUE,right=FALSE))

My problem is that the number of observations in each category is not exactly the same because there are observations (and more than one) which have exactly the same value of the quantiles. How can I do it?

My variable is waiting

[1] 79 54 74 62 85 55 88 85 51 85 54 84 78 47 83 52 62 84 52 79 51 47 78 69 74
[26] 83 55 76 78 79 73 77 66 80 74 52 48 80 59 90 80 58 84 58 73 83 64 53 82 59
[51] 75 90 54 80 54 83 71 64 77 81 59 84 48 82 60 92 78 78 65 73 82 56 79 71 62
[76] 76 60 78 76 83 75 82 70 65 73 88 76 80 48 86 60 90 50 78 63 72 84 75 51 82
[101] 62 88 49 83 81 47 84 52 86 81 75 59 89 79 59 81 50 85 59 87 53 69 77 56 88
[126] 81 45 82 55 90 45 83 56 89 46 82 51 86 53 79 81 60 82 77 76 59 80 49 96 53
[151] 77 77 65 81 71 70 81 93 53 89 45 86 58 78 66 76 63 88 52 93 49 57 77 68 81
[176] 81 73 50 85 74 55 77 83 83 51 78 84 46 83 55 81 57 76 84 77 81 87 77 51 78
[201] 60 82 91 53 78 46 77 84 49 83 71 80 49 75 64 76 53 94 55 76 50 82 54 75 78
[226] 79 78 78 70 79 70 54 86 50 90 54 54 77 79 64 75 47 86 63 85 82 57 82 67 74
[251] 54 83 73 73 88 80 71 83 56 79 78 84 58 83 43 60 75 81 46 90 46 74

which is in the dataset faithful in R. It has 272 observations, therefore it is divisible by 4 giving 68 observations in each category.

I have used

newwait<-cut(waiting, breaks =quantile(waiting,probs=seq(0,1,0.25)),include.lowest=TRUE,right=FALSE)

table(newwait)
newwait
[43,58) [58,76) [76,82) [82,96] 
     66      68      67      71 

as you can see, the number of observations in each group is similar but not exactly the same.

ekad
  • 14,436
  • 26
  • 44
  • 46
user2974841
  • 23
  • 1
  • 5
  • I tried you code with 100/1000/10000/100000 random numbers and I always get 4 groups of the same size. Can you post your data (a part of it maybe) – Michele Nov 09 '13 at 21:48
  • reproducible example: `x <- rep(1:5,c(1,3,3,2,1))`. `table(cut(...))` gives (1,3,3,3) [although this particular example is impossible since `length(x)` isn't divisible by 4 – Ben Bolker Nov 09 '13 at 21:55
  • I have edited my question with the variable – user2974841 Nov 10 '13 at 10:41

1 Answers1

0

Basically, it sounds like you need to deal with ties. You also need to have a vector whose length, when divided by 4, yields an integer...but I'll assume you know that.

Here's a solution using the tie-breaking functions of rank:

set.seed(1)
x <- round(runif(1000,0,1),1)
table(x)
## x
##   0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9   1 
##  43 106  95 103 112 109  82 102  95 100  53

y <- rank(x, ties.method='first') # <- this forces tie breaks
cuts <- cut(y, breaks = quantile(y,probs=seq(0,1,0.25)),
               include.lowest=TRUE,
               right=FALSE)
# check that cuts are all the same length:
lapply(split(x,cuts), length)
$`[1,251)`
[1] 250

$`[251,500)`
[1] 250

$`[500,750)`
[1] 250

$`[750,1e+03]`
[1] 250
Thomas
  • 43,637
  • 12
  • 109
  • 140