1

My question is pretty simple: the cut() function allows to choose the breaks along which I can divide the range of my vector into intervals. I would like to be able to control for the number of observations within the newly created interval, in a way similar to what could be obtained with a quantile argument in the cut() function call. However I don't want to be using the quantile argument because I would like for the intervals to be chosen fixed, so that I can match them between different databases for further comparison, and I want the same discrete values to be found in the labels of the newly cut vectors.

I used to use this for the quantile approach:

df$z<-cut(df$x, quantile(x, (0:10)/10), include.lowest=TRUE)

Which is fairly simple. My new approach is even simpler, so it resembles this for example:

df$z<-cut(df$x, c(0.04,0.055,0.06,0.065,0.07,0.075,0.08,0.085,0.09,0.095,0.11), include.lowest=T)

I then have another variable which I want to calculate some statistics on, according to the levels of the discrete variable.

So it would go something like this :

df$conf.intx<-ifelse(df$z=="1",t.test(df[df$z=="1",]$y)$conf.int[1],
              ifelse(df$z=="2",t.test(df[df$z=="2",]$y)$conf.int[1],
              ifelse(df$z=="3",t.test(df[df$z=="3",]$y)$conf.int[1],
              ifelse(df$z=="4",t.test(df[df$z=="4",]$y)$conf.int[1],NA))))

But for me to be able to calculate this kind of t-test confidence interval on each of the 'pools' of the y values (which number in the same amount as the observations within the intervals of the discrete variable), I need to be able to control for the number of values within each created interval for z, so that my test remains valid, at least as far as the number of observations is concerned.

Simply put, I'd need an automated procedure that would create the vector of breaks for the z variable so that each of them contains a minimum number of observations. As an added complication, it should be the same breaks for two different databases, which I don't know if it's possible.

Any help on the matter would be welcome, thank you in advance.

EDIT: here is a sample of my data for x.

    structure(list(x = c(5.319125, 7.3036667, 5.5166167, 7.0308333, 
5.6812917, 6.5496583, 5.6621833, 6.4682, 5.4897417, 7.185175, 
6.44905, 7.2055833, 7.629375, 6.2282833, 6.6813917, 7.7976, 6.683975, 
5.5089083, 7.307475, 7.3958667, 6.2036583, 6.2488833, 5.9372, 
6.6180167, 6.4167833, 5.640275, 8.7416917, 8.3134167, 6.8996833, 
5.1161917, 7.0606333, 5.2622667, 6.780925, 5.4615417, 6.48185, 
5.51585, 6.2224333, 5.3660667, 7.196525, 6.2984083, 7.0137833, 
7.4490083, 5.9712333, 6.4287833, 7.6693917, 6.4406417, 5.4135083, 
7.16245, 7.2267, 5.820325, 6.066175, 5.760975, 6.4775, 6.2625, 
5.5182583, 8.446625, 8.19025, 6.7955333, 4.7899583, 6.5680167, 
4.5965917, 6.3539333, 4.6639, 6.0489667, 4.9047833, 5.353625, 
4.711425, 6.6268833, 5.5458083, 6.3271917, 6.4591417, 5.1843917, 
5.6117167, 7.1828417, 5.6956917, 5.0271917, 6.741875, 6.68305, 
4.7859667, 5.3068667, 5.3245, 5.745675, 5.7518917, 5.37945, 8.0030417, 
7.7064583, 6.2935333, 5.1838667, 6.9369333, 4.9734583, 6.7257167, 
5.0510333, 6.4257667, 5.2858083, 5.7285167, 5.084, 7.0092833, 
5.905875, 6.6893417, 6.8319583, 5.5558083, 5.9854833, 7.5552167, 
6.064625, 5.3990333, 7.115175, 7.0600167, 5.1644833, 5.6848667, 
5.7014417, 6.1051, 6.1186333, 5.7217667, 8.3685417, 8.071325, 
6.6547333, 5.5972417, 7.4226, 5.539725, 7.26335, 5.645975, 6.87475, 
5.8486167, 6.3001667, 5.5997833, 7.4353167, 6.5089583, 7.213625, 
7.3125667, 6.12095, 6.5410083, 8.0639083, 6.6505167, 5.8886417, 
7.6301167, 7.5850417, 5.7693667, 6.2480167, 6.1847167, 6.6896167, 
6.6323917, 6.1972167, 8.8560333, 8.5501083, 7.1036167, 4.9929583, 
6.9839583, 5.3847417, 6.8814417, 5.59555, 6.7867167, 5.7831333, 
6.9370917, 5.7400917, 7.6922, 6.3151, 7.084725, 7.0414417, 5.95435, 
6.4274167, 7.6692167, 6.9159, 6.0856083, 7.3079583, 7.1937667, 
5.744675, 5.946525, 6.0651833, 6.8488833, 6.5924333, 5.772025, 
8.3281167, 8.5475917, 6.7952917, 8.248525, 5.1931083, 7.0688917, 
5.4793583, 7.0091583, 5.7593, 7.1053333, 5.9382583, 7.1765417, 
6.003075, 7.7699833, 6.2757333, 7.2446583, 7.179275, 6.0013083, 
6.447975, 7.7845833, 6.9071083, 6.1009, 7.425425, 7.4619083, 
5.9380667, 6.2116, 6.13315, 7.0852, 7.0047417, 6.0763917, 8.5926583, 
8.7468417, 7.2485167, 8.5096833, 5.1541, 7.0479917, 5.43065, 
6.9689083, 5.7356, 7.0842917, 5.9051667, 7.1283333, 5.9666667, 
7.7295583, 6.249925, 7.21005, 7.1427167, 5.9675583, 6.4135667, 
7.7448583, 6.874275, 6.0679333, 7.388675, 7.429025, 5.911225, 
6.1757167, 6.095225, 7.045775, 6.9870833, 6.0567333, 8.5771167, 
8.7541917, 7.3187333, 8.5092083, 5.5746, 7.342925, 5.8561667, 
7.4704667, 5.922225, 6.9787, 6.1564167, 7.6059667, 5.9122917, 
7.7848833, 6.6192, 7.34055, 7.2352417, 5.9776083, 6.5197583, 
7.4891583, 7.2185667, 6.4710167, 7.70945, 7.5078083, 6.1470417, 
6.66115, 6.6899333, 7.4454083, 7.2270917, 6.350075, 8.3156667, 
8.9007917, 6.7578083, 8.3258083, 5.1996, 6.9688833, 5.3592917, 
6.7583417, 5.5623583, 6.756375, 5.7361, 7.120425, 5.6567, 7.6174667, 
6.1474833, 7.1442167, 6.74475, 5.5820333, 6.0106, 7.142675, 6.667475, 
5.9067917, 7.2392, 7.058675, 5.6394417, 5.9119167, 5.8367333, 
6.798025, 6.694675, 5.8565917, 8.6035083, 8.912375, 7.0501083, 
8.38045, 4.8478083, 6.7493167, 5.3686667, 6.5152333, 5.282025, 
6.5464333, 5.5085583, 6.870975, 5.4757667, 7.318, 5.92225, 6.9300417, 
6.5758083, 5.4233083, 5.8295583, 7.0451, 6.4790083, 5.68255, 
6.9632833, 6.9965833, 5.5005667, 5.717725, 5.5938083, 6.5309, 
6.4824583, 5.4429833, 8.072575, 8.3635, 6.5797167, 8.0352333, 
4.6289833, 6.64105, 4.8883833, 6.2025833, 5.2291833, 6.4814667, 
5.2211083, 6.5780083, 5.196275, 7.030725, 5.6001583, 6.620475, 
6.2858333, 5.114375, 5.5424417, 6.7784917, 6.1561333, 5.339375, 
6.6249083, 6.6248583, 5.139775, 5.4195, 5.4531833, 6.3348583, 
6.4041417, 5.292, 7.6243833, 7.9624583, 6.3226417, 7.761175, 
4.8419083, 6.8384083, 5.3500417, 6.5903333, 5.33275, 6.732575, 
5.4486, 6.8069417, 5.4569583, 7.26275, 5.835525, 6.8680333, 6.6712333, 
5.4720417, 5.904325, 7.1506917, 6.4746833, 5.638675, 6.9570667, 
7.0017333, 5.5033667, 5.6859333, 5.651875, 6.5903, 6.529725, 
5.4819667, 7.971975, 8.2337833, 6.5815333, 7.9736583, 5.7711917, 
7.543325, 5.8986917, 7.5081333, 6.2920333, 7.5321667, 6.4908917, 
7.7616583, 6.4509417, 8.08035, 6.8219, 7.7939167, 7.6491333, 
6.4773583, 6.9338667, 8.1865583, 7.3998917, 6.572125, 7.9198417, 
8.0568, 6.5880333, 6.8299667, 6.7399833, 7.6436, 7.509275, 6.5139833, 
9.1520167, 9.3580667, 7.65415, 9.0725167, 5.7483583, 7.5230417, 
5.89105, 7.4808833, 6.1969667, 7.4923583, 6.4092583, 7.70695, 
6.3970833, 8.0971333, 6.7949083, 7.76445, 7.6170167, 6.4494333, 
6.8997, 8.1575333, 7.3728417, 6.544075, 7.888, 8.0215, 6.5484, 
6.7911667, 6.7121917, 7.6179083, 7.4731167, 6.4629167, 9.1226333, 
9.3307083, 7.6230583, 9.024875, 5.543925, 7.1460833, 5.6575583, 
7.5986083, 6.027075, 7.4386167, 6.3500333, 7.6694833, 6.3682583, 
8.0843333, 6.7181083, 7.7376, 7.5818583, 6.4010667, 6.8440083, 
8.1217917, 7.3290833, 6.5187333, 7.8591667, 7.9898583, 6.5051, 
6.7251167, 6.6881333, 7.477675, 7.3571333, 6.3351833, 8.881575, 
9.12315, 7.3851, 8.8008667, 5.3437833, 7.1560417, 5.5748, 7.4622583, 
5.9412417, 7.3428667, 6.2594167, 7.5839167, 6.28685, 8.0270917, 
6.6388333, 7.6611, 7.50065, 6.3217167, 6.7594417, 8.0401167, 
7.252425, 6.444, 7.77975, 7.9104167, 6.42495, 6.6421667, 6.6103333, 
7.3489417, 7.23205, 6.2059333, 8.726725, 8.994625, 7.2460917, 
8.660125, 5.2502833, 7.2591, 5.6425417, 6.889925, 5.353675, 6.50635, 
6.260675, 7.4236583, 5.9076417, 7.3915, 6.2134917, 7.1645333, 
6.922675, 6.0295417, 6.1687917, 7.2771083, 6.6152333, 6.3299417, 
7.167325, 6.647275, 5.726475, 5.93905, 6.2888583, 6.7497167, 
6.4364083, 5.8906583, 7.6052917, 8.039425, 6.5672833, 7.8754667, 
6.3086333, 5.352025, 7.2849417, 5.7184833, 6.9675917, 5.5615333, 
6.6157917, 6.3505417, 7.4881, 6.0007417, 7.5110583, 6.35525, 
7.254075, 7.0289083, 6.1994417, 6.2860833, 7.372575, 6.735975, 
6.4628917, 7.3102167, 6.8619417, 5.9123667, 6.1611917, 6.4854083, 
6.8942417, 6.563625, 6.0610083, 7.941625, 8.6969167, 6.66075, 
8.1197167, 6.2802, 3.9638, 5.870825, 4.1852, 5.5841417, 4.3007583, 
5.2352167, 4.4281417, 5.819425, 4.1990917, 5.9338917, 4.89765, 
5.7204333, 5.6546833, 4.5632167, 4.9803333, 5.6962417, 5.247725, 
4.7092583, 6.0145417, 5.6403917, 4.4016917, 4.7181, 4.5007833, 
5.2828917, 5.1314167, 4.7492, 6.777575, 6.9040083, 4.9760583, 
6.4471917, 5.0952833, 3.712725, 5.8215333, 4.025725, 5.5635, 
4.2354083, 5.143525, 4.4900083, 5.6802417, 4.1214333, 5.8128, 
4.7525583, 5.6412583, 5.5534917, 4.487475, 4.8237833, 5.6156917, 
5.0573, 4.5755417, 5.8096083, 5.5252083, 4.3145583, 4.5437417, 
4.194675, 5.0100833, 4.8972333, 4.590025, 6.6441417, 6.5789417, 
4.6947667, 6.1648167, 4.8517333, 3.982925, 5.7966833, 4.1607083, 
5.5564833, 4.2557417, 5.2304083, 4.8661333, 5.912875, 4.4988333, 
6.03915, 4.9131583, 5.8518667, 5.6578583, 4.773225, 4.8958583, 
5.8759833, 5.204725, 4.8961667, 5.9217, 5.58395, 4.5410667, 4.73445, 
4.5922333, 5.2517333, 5.0220333, 4.619475, 6.4883667, 6.429175, 
4.6796417, 6.3171083, 4.93615, 3.9278833, 5.7590417, 4.1155667, 
5.612725, 4.2199833, 5.2126667, 4.805275, 5.8888833, 4.4363, 
6.0380083, 4.892, 5.8192083, 5.64205, 4.708825, 4.8751583, 5.833775, 
5.2210417, 4.853225, 5.924225, 5.5856583, 4.5386167, 4.7280917, 
4.5618, 5.264425, 5.03855, 4.5539, 6.4993, 6.4900667, 4.6749083, 
6.2961333, 4.918525, 4.0890583, 6.33385, 4.3470083, 5.9645, 4.6541833, 
5.5438667, 4.9556583, 6.1590583, 4.6379417, 6.2876833, 5.2235167, 
6.1387167, 6.0547583, 4.9545667, 5.254125, 6.05395, 5.4813417, 
4.9971333, 6.2266583, 5.9172833, 4.7275917, 4.9274917, 4.443575, 
5.3164917, 5.2507083, 5.1704583, 7.173075, 6.9351583, 5.0816667, 
6.5568, 5.3417667, 5.1705167, 7.0777833, 5.6253333, 7.231225, 
5.5799167, 6.6942917, 6.1014583, 7.538725, 5.7152667, 7.459275, 
6.2406083, 7.064925, 6.9234417, 5.8328833, 6.1819583, 7.2127583, 
6.8071583, 6.2599417, 7.2975417, 6.973875, 5.804125, 6.1944667, 
6.38855, 7.0553583, 6.8393167, 6.1275417, 7.9986833, 8.5846, 
6.4682167, 8.0134583, 6.1805917, 5.0699583, 6.9006667, 5.36365, 
6.9204917, 5.4478667, 6.5391583, 6.0647417, 7.2951667, 5.6632833, 
7.25595, 6.1057333, 6.9578417, 6.8235583, 5.8671833, 6.0716417, 
7.060175, 6.5401, 6.1229417, 7.1305083, 6.7823417, 5.62415, 5.9202, 
5.9957167, 6.7142167, 6.4706417, 5.9004667, 7.8304583, 8.2144667, 
6.1530583, 7.6896417, 5.9285333, 4.2625417, 5.9677583, 4.58695, 
6.0400083, 4.4215333, 5.6052833, 5.04165, 6.48845, 4.6423583, 
6.1688833, 5.0256167, 5.926725, 5.7214667, 4.746375, 4.9828, 
6.1583083, 5.6903, 5.217375, 6.1341583, 5.7868083, 4.5895333, 
4.98235, 5.159725, 5.7866167, 5.6300833, 4.882975, 6.7210833, 
7.4314833, 5.2493083, 6.8503833, 5.2225583, 3.8417833, 5.9798, 
4.1168583, 5.63415, 4.3311333, 5.0777667, 4.6606833, 5.789425, 
4.3565167, 5.9736167, 4.8910667, 5.9445417, 5.699275, 4.6897167, 
4.9036083, 5.8767, 5.088675, 4.6224417, 5.8052833, 5.5697167, 
4.3237, 4.6084333, 4.2958833, 5.1394417, 5.0137583, 4.7711, 6.771275, 
6.5984417, 4.845625, 6.3338083, 5.1370333, 3.1820167, 5.2699667, 
3.4827167, 5.0992583, 3.7040583, 4.6358583, 4.1604917, 5.2488333, 
3.7522, 5.3774167, 4.2636167, 5.1998167, 5.0456333, 4.051475, 
4.289175, 5.1718917, 4.5787083, 4.1461667, 5.2983167, 5.03025, 
3.8709333, 4.0917167, 3.731925, 4.5584167, 4.4200333, 4.061375, 
6.064225, 6.02975, 4.1590167, 5.6589083, 4.2614833, 3.68695, 
5.587375, 3.91725, 5.3387, 4.0061667, 4.9563833, 4.1942, 5.6720583, 
3.9584333, 5.6873583, 4.6251, 5.4801417, 5.3975583, 4.2382, 4.6710917, 
5.4898083, 5.0469667, 4.4950083, 5.72005, 5.46085, 4.30355, 4.5525917, 
4.3681667, 5.1723167, 5.0331417, 4.4793083, 6.5492917, 6.720225, 
4.7550917, 6.197775, 4.8082917, 4.09925, 5.986525, 4.3104417, 
5.68455, 4.4287167, 5.3555667, 4.5191083, 5.9269833, 4.2695917, 
5.9984167, 4.981225, 5.8049917, 5.7680667, 4.5736667, 5.0673583, 
5.7443583, 5.2811083, 4.719175, 6.0376667, 5.73875, 4.3947333, 
4.8157333, 4.6093417, 5.3906417, 5.2357417, 4.684825, 6.8885583, 
7.018425, 5.0878167, 6.5122333, 5.2084, 3.810525, 6.2600083, 
3.6246583, 5.7396417, 4.0617917, 5.6724583, 4.2505833, 4.7518417, 
4.1232, 6.208375, 4.5881167, 5.252575, 5.71795, 4.0840583, 4.700325, 
6.2360333, 4.701725, 3.922525, 5.5162167, 5.6220333, 3.8836833, 
4.4883667, 4.5398583)), .Names = "x", row.names = c(NA, -962L
), class = "data.frame")

Assuming I want 30 values per interval (the 'n'), here is the code I used:

df$z<-cut(df$x, seq(30,length(df$x),by=30)/length(df$x), include.lowest=T)

Which gives me:

> table(df$z)

[0.0312,0.0624] (0.0624,0.0936]  (0.0936,0.125]   (0.125,0.156]   (0.156,0.187]   (0.187,0.218]   (0.218,0.249]   (0.249,0.281]   (0.281,0.312]   (0.312,0.343]   (0.343,0.374] 
              0               0               0               0               0               0               0               0               0               0               0 
  (0.374,0.405]   (0.405,0.437]   (0.437,0.468]   (0.468,0.499]    (0.499,0.53]    (0.53,0.561]   (0.561,0.593]   (0.593,0.624]   (0.624,0.655]   (0.655,0.686]   (0.686,0.717] 
              0               0               0               0               0               0               0               0               0               0               0 
  (0.717,0.748]    (0.748,0.78]    (0.78,0.811]   (0.811,0.842]   (0.842,0.873]   (0.873,0.904]   (0.904,0.936]   (0.936,0.967]   (0.967,0.998] 
              0               0               0               0               0               0               0               0               0 

What I want is a similar result to what I get with quantiles:

df$zbis<-cut(df$x, quantile(df$x, (0:20)/20), include.lowest=T)
table(df$zbis)

[3.18,4.29] (4.29,4.62] (4.62,4.89] (4.89,5.14] (5.14,5.33] (5.33,5.53] (5.53,5.66]  (5.66,5.8]  (5.8,5.94]  (5.94,6.1]  (6.1,6.26] (6.26,6.45] (6.45,6.58] (6.58,6.74] (6.74,6.93] 
         49          48          48          48          48          48          48          48          48          48          48          48          48          48          48 
(6.93,7.14] (7.14,7.34] (7.34,7.62] (7.62,8.06] (8.06,9.36] 
         48          48          48          48          49 

Except I'd like this to be reproducible for another database, and so I can't use the quantile function, since I would not get the same intervals on a different database.

SECOND EDIT: here is the second sample from another database. 'x' is the same variable, and they have similar ranges.

structure(list(x = c(5.319125, 7.3036667, 5.5166167, 7.0308333, 
5.6812917, 6.5496583, 5.6621833, 6.4682, 5.4897417, 7.185175, 
6.44905, 7.2055833, 7.629375, 6.2282833, 6.6813917, 7.7976, 6.683975, 
5.5089083, 7.307475, 7.3958667, 6.2036583, 6.2488833, 5.9372, 
6.6180167, 6.4167833, 5.640275, 8.7416917, 8.3134167, 6.8996833, 
5.1931083, 7.0688917, 5.4793583, 7.0091583, 5.7593, 7.1053333, 
5.9382583, 7.1765417, 6.003075, 7.7699833, 6.2757333, 7.2446583, 
7.179275, 6.0013083, 6.447975, 7.7845833, 6.9071083, 6.1009, 
7.425425, 7.4619083, 5.9380667, 6.2116, 6.13315, 7.0852, 7.0047417, 
6.0763917, 8.5926583, 8.7468417, 7.2485167, 8.5096833, 5.177275, 
7.09985, 5.6444667, 7.0102417, 5.7303833, 7.0383333, 5.9870583, 
7.3342083, 5.9363667, 7.7753333, 6.38355, 7.389575, 7.0396667, 
5.889625, 6.29395, 7.51135, 6.940925, 6.1455417, 7.4281833, 7.4657167, 
5.9707083, 6.1902083, 6.0936167, 6.9595167, 6.85065, 5.8525, 
8.5148083, 8.805625, 7.00665, 8.4457, 5.3437833, 7.1560417, 5.5748, 
7.4622583, 5.9412417, 7.3428667, 6.2594167, 7.5839167, 6.28685, 
8.0270917, 6.6388333, 7.6611, 7.50065, 6.3217167, 6.7594417, 
8.0401167, 7.252425, 6.444, 7.77975, 7.9104167, 6.42495, 6.6421667, 
6.6103333, 7.3489417, 7.23205, 6.2059333, 8.726725, 8.994625, 
7.2460917, 8.660125, 3.614125, 5.6345917, 3.9410417, 5.2901417, 
4.0147333, 4.766825, 4.4500417, 5.5189, 4.11375, 5.6350667, 4.5756917, 
5.5998833, 5.3663, 4.44405, 4.5767417, 5.552025, 4.847425, 4.4382583, 
5.5769417, 5.2390667, 4.0610917, 4.4054833, 4.1917, 4.9029083, 
4.6935917, 4.3499417, 6.0562333, 6.081225, 4.45855, 6.0121583, 
4.740275, 4.5028, 6.4177833, 4.8716417, 6.1469917, 4.6208917, 
5.7748083, 5.4530083, 6.694125, 5.0944333, 6.5123167, 5.3257083, 
6.2765333, 6.0149167, 5.1815583, 5.30715, 6.4149083, 5.82245, 
5.515425, 6.3654333, 5.8472833, 4.9798917, 5.1833583, 5.5210333, 
6.0410667, 5.7377917, 5.2666083, 7.0378167, 7.744175, 5.718725, 
7.3220583, 5.24325, 5.3256, 7.2155167, 5.696925, 7.0029667, 5.5235, 
6.7261083, 6.2810667, 7.546825, 5.90915, 7.3299167, 6.2227333, 
7.147075, 6.9142417, 6.0012083, 6.1725333, 7.29815, 6.7, 6.3454583, 
7.2129583, 6.7559833, 5.8115, 6.0756667, 6.458225, 6.9969167, 
6.778825, 6.2245833, 8.0809583, 8.875325, 6.7210917, 8.3203, 
6.3513, 5.2591333, 7.1404917, 5.6266417, 6.9356, 5.4568, 6.6604, 
6.206025, 7.48525, 5.8323667, 7.24635, 6.1446583, 7.066275, 6.8334, 
5.9198667, 6.09505, 7.2206583, 6.63085, 6.270075, 7.1397333, 
6.689125, 5.7441333, 6.042575, 6.38255, 6.9325833, 6.7175667, 
6.1592, 8.00415, 8.8051167, 6.647125, 8.2465667, 6.2788167, 6.49435, 
8.1847583, 6.664475, 8.0528583, 6.6822417, 7.376, 7.1517833, 
8.2306833, 6.8584583, 8.3052167, 7.288375, 8.2758583, 7.7162583, 
7.2807833, 7.0459, 8.2507833, 7.5855, 7.0505917, 8.2230167, 8.1669, 
6.8184667, 6.9700583, 7.0936167, 7.7615667, 7.6239083, 7.0921667, 
9.02585, 9.3416167, 7.6256333, 9.0869333, 8.0984667, 4.116325, 
6.1680917, 4.56965, 5.797725, 4.36085, 5.42455, 5.144075, 6.1531833, 
4.77825, 6.2533417, 5.0192083, 5.99395, 5.6934083, 4.9074167, 
4.9823083, 5.9861667, 5.4068833, 5.1872833, 6.10095, 5.659325, 
4.6632833, 4.86315, 5.221775, 5.5878, 5.3217083, 4.8202333, 6.4883083, 
6.69355, 4.952075, 6.7075583, 5.00015, 5.2502833, 7.2591, 5.6425417, 
6.889925, 5.353675, 6.50635, 6.260675, 7.4236583, 5.9076417, 
7.3915, 6.2134917, 7.1645333, 6.922675, 6.0295417, 6.1687917, 
7.2771083, 6.6152333, 6.3299417, 7.167325, 6.647275, 5.726475, 
5.93905, 6.2888583, 6.7497167, 6.4364083, 5.8906583, 7.6052917, 
8.039425, 6.5672833, 7.8754667, 6.3086333, 5.352025, 7.2849417, 
5.7184833, 6.9675917, 5.5615333, 6.6157917, 6.3505417, 7.4881, 
6.0007417, 7.5110583, 6.35525, 7.254075, 7.0289083, 6.1994417, 
6.2860833, 7.372575, 6.735975, 6.4628917, 7.3102167, 6.8619417, 
5.9123667, 6.1611917, 6.4854083, 6.8942417, 6.563625, 6.0610083, 
7.941625, 8.6969167, 6.66075, 8.1197167, 6.2802, 3.9638, 5.870825, 
4.1852, 5.5841417, 4.3007583, 5.2352167, 4.4281417, 5.819425, 
4.1990917, 5.9338917, 4.89765, 5.7204333, 5.6546833, 4.5632167, 
4.9803333, 5.6962417, 5.247725, 4.7092583, 6.0145417, 5.6403917, 
4.4016917, 4.7181, 4.5007833, 5.2828917, 5.1314167, 4.7492, 6.777575, 
6.9040083, 4.9760583, 6.4471917, 5.0952833, 3.712725, 5.8215333, 
4.025725, 5.5635, 4.2354083, 5.143525, 4.4900083, 5.6802417, 
4.1214333, 5.8128, 4.7525583, 5.6412583, 5.5534917, 4.487475, 
4.8237833, 5.6156917, 5.0573, 4.5755417, 5.8096083, 5.5252083, 
4.3145583, 4.5437417, 4.194675, 5.0100833, 4.8972333, 4.590025, 
6.6441417, 6.5789417, 4.6947667, 6.1648167, 4.8517333, 4.1059833, 
5.9023167, 4.2812417, 5.6593917, 4.3587583, 5.3359583, 4.983275, 
6.0223417, 4.6178333, 6.1545333, 5.0244667, 5.9596, 5.7608833, 
4.8875333, 4.9990583, 5.9919333, 5.3157417, 5.0169333, 6.024775, 
5.6717167, 4.6372083, 4.8370583, 4.7311333, 5.3704, 5.133575, 
4.7174917)), .Names = "x", row.names = c(NA, -455L), class = "data.frame")
Chris. Z
  • 365
  • 1
  • 7
  • 17
  • I'm always amazed that people think putting in `etc` will explain a task when only one instance is offered. – IRTFM May 01 '15 at 17:02
  • I edited the post. The example does not really matter though. Here it just shows how the cut() call would go if the z variable had 4 levels. – Chris. Z May 01 '15 at 17:15
  • The premise of this request, i.e. that there needs to be an equal or similar number of items in a category for t-tests to be valid, is statistically incorrect. The t.test function takes into account the number of items and adjusts for differences. (Furthermore the code is incorrect since the names of the "z" variable would not be "1","2","3", etc.) – IRTFM May 01 '15 at 17:28
  • True, I omitted to add the 'labels=c(1:length(levels(df$z)))' argument. I understand that the t-test is flexible but it does not change my request: is there a way I can control for the size of the populations within the intervals created through the breaks? – Chris. Z May 01 '15 at 17:34
  • I don't understand while using `quantile` with an appropriate argument would not succeed. If the sample size is N and the desired number per group is n then splitting at probabilities something like: `seq(n, N, by=n)/N` should succeed. – IRTFM May 01 '15 at 17:41
  • The problem with quantile is I can't control for the step and thus obtain the same intervals for two different databases... I tried your solution but somehow it gave me an empty table for the z variable. – Chris. Z May 01 '15 at 18:19
  • It's been done; as user Thomas pointed it out in his final comments, I think I'd need to code for a function that would implement the cuts simultaneously on both databases for a shared variable... Which I cannot do due to my insufficient coding skills. – Chris. Z May 01 '15 at 19:36

2 Answers2

1

Updated after some comments:

Since you state that the minimum number of cases in each group would be fine for you, I'd go with Hmisc::cut2

v <- rnorm(10, 0, 1)
Hmisc::cut2(v, m = 3) # minimum of 3 cases per group

The documentation for cut2 states:

m   desired minimum number of observations in a group.
    The algorithm does not guarantee that all groups will have at least m observations.

The same cuts for separate variables

If the distributions of your variables are very similar you could extract the exact cutpoints by setting the argument onlycuts = T and reuse them for the other variables. In case the distributions are different though, you will end up with few cases in some intervals.

Using your data:

library(magrittr)
library(Hmisc)

cuts <- cut2(df1$x, g = 20, onlycuts = T) # determine cuts based on df1

cut2(df1$x, cuts = cuts) %>% table
cut2(df2$x, cuts = cuts) %>% table*2 # multiplied by two for better comparison
Thomas K
  • 3,242
  • 15
  • 29
  • I didn't see anything in the question that implied equal spacing of the original data. – IRTFM May 01 '15 at 17:48
  • @Thomas: Your solution does not allow for control over the size of each interval though. if I apply it to my data, all it does is create the number of intervals corresponding to the specified ratio for the breaks. It's not what I'm after in this case. – Chris. Z May 01 '15 at 18:15
  • @user2092517 then I obviously failed in understanding what your question is about - sorry. – Thomas K May 01 '15 at 18:18
  • I've been doing it by hand (somehow): I calculated a step in order to roughly divide the continuous variable x into 20 intervals. The problem is that I get some empty intervals, or other where the values are not numerous enough to apply the confidence interval formula. In consequence, I have to delete some breaks in order to get decent looking intervals, meaning they have more than 30 observations in this case. What I want to do is somehow automatically get intervals with a certain number of values, nevermind that the step for the interval is not fixed, without having to manually do it. – Chris. Z May 01 '15 at 18:24
  • @user2092517 maybe you could provide us with sample data for `x` and with your desired result `z`. But from your comment I understand: you would like to be able to control for the _minimum_ number of observations, each interval contains. Is that correct? Or do you need to control for the _specific_ number of observation in each interval? – Thomas K May 01 '15 at 18:35
  • Minimal number would be fine! I also want to be able to reproduce the result for two different databases and still get the same intervals, since I'd be comparing the same variable. – Chris. Z May 01 '15 at 18:38
  • It does, thank you! But I realize I may have an unatteignable goal, to be able to have exactly the same breaks (albeit with different number of observations for each interval, which would not matter in my case) for two different databases with different total numbers of observations... – Chris. Z May 01 '15 at 18:54
  • @user2092517 you would need to write a function which determines the cuts, taking in account both variables. that's not in my league... – Thomas K May 01 '15 at 19:33
  • Unfortunately, neither is it in mine. Thank you again for your input! – Chris. Z May 01 '15 at 19:35
  • @user2092517 In case the distributions of the variables are very similar, you could just use the cuts from one variable for the other. but if the distributions are different, you will end up with very few cases in some intervals. – Thomas K May 01 '15 at 19:44
  • Actually it did seem to work very well on my sample! I'm gonna try it on the whole databases and see how it turns out. Thank you very much! – Chris. Z May 01 '15 at 20:05
  • @user2092517 I updated my answer to include this suggestion. Please feel free to accept my post as answered, in case it solved your problem. :) – Thomas K May 01 '15 at 20:20
  • Well it does seem to be doing the trick on the full databases for all the variables. It may need some tweaks here in there, but that's inherent to the data, not the method. Problem solved then! Again, thank you very much! – Chris. Z May 01 '15 at 20:32
  • @user2092517 Although it is for sure not the most elegant solution, it is great that it worked out for you! – Thomas K May 01 '15 at 20:41
1

This is a good example of how NOT to pose a question. At last we have an example an, it is possible to post code that applies to it. (You apparently naively pasted the exact code in my comment without thinking about how to express 'n' and 'N' in the context of the problem. I did need to add prob=c( seq(...) , 1) in order to capture the highest values.

This assumes that you want groups of size 100 (although it is still very unclear why this is needed).

 x$xct <- cut( x$x, breaks=quantile(x$x, prob=c( seq(100, length(x$x), by=100)/length(x$x) , 1) ))
 table(x$xct)

(4.64,5.17] (5.17,5.57] (5.57,5.85] (5.85,6.17] (6.17,6.51] (6.51,6.85] 
        100         100         100         100         100         100 
(6.85,7.26] (7.26,7.94] (7.94,9.36] 
        100         100          62 
IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • Thank you. Obviously I am not that proficient with R, so I really appreciate you taking time to help me. However, I did realize that this only partially solves my conumdrum: I'd need to be able to obtain the same breaks for a different database (on the same variable), in order to be able to carry out comparisons. I added a sample of this second database, if you'd want to take a look. It would mean computing the breaks simultaneously for both databases. – Chris. Z May 01 '15 at 19:55