Group a continuous variable in R

Question

My aim is to compare in a pivot table if there is a link between the presence of one particular shop and the density of population where we can find these shops. For that, I've a CSV file, with 600 exemples of areas where there is OR not the shop. It's a file with 600 lines and two columns : 1/ a number who represent the density of populaiton for one area, and 2/ the quantity of this particular shop in this area (0, 1 or 2).

In order to do a pivot table, I need to group the densities in 10 groups of 60 lines for each (in the first group the 60 bigger densities until the last group with the 60 smaller densities). Then, I'll easily be able to see how many shops are built, whether the density is low or high. Am I understandable (I hope) ? :)

Nothing really difficult I suppose. But there are some much way (and package) which could be ok for that... that I'm a little bit lost.

My main issue : which is the simplest way to group my variable in ten groups of 60 lines each ? I've tried cut()/cut2() and hist() without success, I heard about bin_var() and reshape() but I don't understand how they can be helpful for this case.

For example (as Justin asked). With cut():

data <- read.csv("data.csv", sep = ";")
groups <- cut(as.numeric(data$densit_pop2), breaks=10)
summary(groups)
(0.492,51.4]   (51.4,102]    (102,153]    (153,204]    (204,255]    (255,306] 
      53           53           52           52           52           54 
(306,357]    (357,408]    (408,459]    (459,510] 
      52           59           53           54

Ok, good, indeed 'groups' contains 10 groups with almost the same number of lines. But certains values indicated in the intervals don't make any sens for me. Here is the first lines of density column (increasly sorted) :

> head(data$densit_pop2)
[1] 14,9 16,7 17,3 18,3 20,2 20,5
509 Levels: 100 1013,2 102,4 102,6 10328 103,6 10375 10396,8 104,2 ... 99,9

I mean, look at the first group. Why 0.492 when 14.9 is my smallest value ? And, if I count manually how many lines between the first one and the value 51.4, I find 76. Why is it indicated 53 lines ? I precise that the dataframe are correctly ranked from lowest to highest.

I certainly miss something... but what ?

Can you share some of what you have tried, as in the actual code? How come `cut` and `hist` work? Do you really want even numbers in each group or even distribution across the range? — Justin, Sep 10 '13 at 22:40
You can use the [percentile](http://stackoverflow.com/questions/7165683/where-is-the-percentile-function-in-cran-r) — Metrics, Sep 11 '13 at 00:46
@justin Very good question ! I *really* need even numbers in each group. It's exactly that I fail to make with 'cut' and 'hist'... — jonathan, Sep 11 '13 at 07:14
@Metrics Sorry, I'm not familiar with this notion. Can you develop ? :) — jonathan, Sep 11 '13 at 08:25
I think at least part of the problem is that you didn't tell R that you are using commas as your decimal points. Using `read.csv2` will help. Because you didn't set your decimal points to commas, you are attempting to convert a factor directly to a numeric variable in `cut`. That's why the output of `cut` doesn't cover the range of your variable. — aosmith, Sep 11 '13 at 15:21
@aosmith wonderful ! you're completly right ! thanks a lot for that... :) — jonathan, Sep 11 '13 at 19:16

score 3 · Accepted Answer · answered Sep 11 '13 at 15:54

I think you'll be happy with cut2 once you have a numeric variable to work with. When using commas as your decimal separator, use read.csv2 or use the argument dec = "," when reading in a dataset.

y = runif(600, 14.9, 10396.8)

require(Hmisc)
summary(cut2(y, m = 60))

You can do the same thing with cut, but you would need to set your breaks at the appropriate quantiles to get equal groups which takes a bit more work.

summary(cut(y, breaks = quantile(y, probs = seq(0, 1, 1/10)), include.lowest = TRUE))

Perfect ! I've done what I was looking for. Thanks for all ! :) — jonathan, Sep 11 '13 at 19:55

IRTFM · Answer 2 · 2013-09-11T19:27:41.227

Responding to your data: you need to correct errors in data entry:

data$densit_pop3 <- as.numeric( 
                       sub('\\,', '.',  
                       as.character(data$densit_pop2)))

Then. Something along these lines (assuming this is not really a question about loading data from text files):

with(dfrm, by(dens, factor(shops), summary) )

As an example of hte output one might get:

with(BNP, by( proBNP.A, Sex, summary))

Sex: Female
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
    5.0    55.7   103.6   167.9   193.6  5488.0 3094899 
--------------------------------------------------------------------- 
Sex: Male
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
      5      30      63     133     129    5651 4013760

If you are trying to plot this to look at the density of densities (which in this case seems like a reasonable request) then try this:

require(lattice)
densityplot( ~dens|shops, data=dfrm)

(And please do stop calling these "pivot tables". That is an aggregation strategy from Excel and one should really learn to describe the desired output in standard statistical or mathematical jargon.)

Ok, I reproduce it, it seems useful to sort and select datas. But (sorry), I don't understand how can I make my 10 groups of 60 lines with that. I don't want to select datas, I want to sort a column increasly (ok, easy), and pile up in 10 groups to obtain 10 lines. Hard to explain simply. And sorry to insist. And ok for stop using the ugly "pivot tables", it was just a translation facility... and thanks for this first answer. ;) — jonathan, Sep 11 '13 at 07:51

Group a continuous variable in R

2 Answers2