2

I have a numeric field in a data frame such as monthly income, the range differ from INR 15000 to INR 60000.

I want a new field say income_group , which will have a number corresponding to a range of income say less than 15000 is 1, more than 15000 but less than 30000 is 2 and so on.

One approach is to use nested ifelse statement like this

mydataframe$incomegp <- ifelse(monthincome_condition, assign_number, 
                               ifelse statement and so on)

But as I have around 7 different number pertaining to this range , so I was looking for a more elegant solution. Also the numbers for classfication are not sequential e.g. 1, 3, 5, 7, 9, 12 , 15.

I am new to R, can somebody please suggest some alternatives which doesn't require nesting?

An example would be great and will help me.

Poptimist
  • 65
  • 5

1 Answers1

5

The following piece of code uses cut to cut up a vector of data into 4 categories (5 breaks), an example with an R builtin dataset:

with(mtcars, cut(mpg, seq(min(mpg) * 0.99, 
                          max(mpg) * 1.01, 
                          length = 5)))

Note that I add the * 0.99 and * 1.01 because if you set them to the min and max of the data itself, data equal to that min/max will be marked as NA.

If you know your breaks in advance, you can simply specify them manually in a vector (c(break_value1, break_value2, etc)) in stead of generating them on-the-fly using seq.

Paul Hiemstra
  • 59,984
  • 12
  • 142
  • 149
  • I am trying to do this example on cut where aaa-> c(1,2,3,4,5,6,7,8,9,10) . Now I am using this cut(aaa,c(0.9,2.9,5.9),labels=c("A","B","C")) . What I want to accomplish to label 1,2 as A 3,4,5 as B and rest as C. But it throws an error "labels/breaks length conflict". As I am mentioning three breaks and three labels , Why am I getting this error.I think this needs to answered for resolving my original question. – Poptimist Jun 19 '13 at 12:52
  • 1
    Remember that three breaks leads to two labels, <1,2>,<2,3>, and that your example leads to a number of `NA` values: this works cut(1:10, c(0.9, 2.9, 5.9), labels = c('A','B')). – Paul Hiemstra Jun 19 '13 at 15:29
  • the solution and explanation are indeed to the point and helpful. Thanks. – Poptimist Jun 19 '13 at 16:12