0

I am looking for a more efficient method of re-coding column entries in a dataframe, where the recoding is conditional on the entries in other columns.

Take this simple example, which demonstrates my current procedure of creating a new column for the recoded data, converting it to character, and then using the subset square brackets to recode the data (is there an official name for this procedure?).

## example data frame
df = data.frame( id = seq( 1 , 100 , by=1 ) ,
                 x = rep( c("W", "Z") , each=50),
                 y = c( rep( c("A","B","C","D") , 25 ) ) )

# add a new column based on column y; convert to character 
df$newY = as.character( df$y ) 

# change newY entries to numbers based on conditions in other columns
df$newY[ df$x == "W" & df$newY == "B" ] <- 1
df$newY[ df$x == "Z" & df$newY == "D" ] <- 3

This procedure is fine for recoding variables with a small number of conditions, but becomes cumbersome for larger number of conditional arguments or when there are many distinct variables to recode.

Could anyone help me with finding a more efficient method of doing this?

Thanks!

user3237820
  • 211
  • 1
  • 8
  • Is there some kind of logic or pattern in the recoding? By efficiency, do you mean a method that requires less typing or do you mean faster performance / memory efficiency? – talat Feb 24 '16 at 10:17
  • how many conditions are we talking? – CuriousBeing Feb 24 '16 at 10:18
  • Would something like this solve your problem: `df$newY = as.factor( paste0(df$y, df$x) ) ; as.numeric(df$newY)` – Raad Feb 24 '16 at 10:21
  • @MaxPD In my data, recoding is conditional on one other column in the dataframe (as in the example) but there are up to four multiples of the same variable needing to be converted to the same new coding (e.g. imagine if there was A1, A2, A3 etc. in the above example needing to be recoded to 1, conditional on "W"). There are also 8 distinct variables in the conditional column (e.g. "x" column above), and up to 11 different variables in the y column, meaning 8 blocks of ~ 11 lines of recoding. I hope that is clear. – user3237820 Feb 24 '16 at 10:29
  • @docendodiscimus By efficiency, I do mean just less typing, sorry. For patterning, it's difficult. For instance, a variable, e.g. A, conditional on Z, may have to be recoded as 1, but A conditional on W needs recoding as 2. Perhaps a better approach would be to reshape the data frame from long to wide format, and recode each variable as a separate column... – user3237820 Feb 24 '16 at 10:31
  • .. well thats life. If there are too many conditions, be it, you have to code them ayy. – CuriousBeing Feb 24 '16 at 10:32
  • @NBATrends Thank you, although this solution returns a column combining two conditions (e.g. AW , AZ). For my example, I require recoding to numbers (although I may be missing the versatility of your solution!). – user3237820 Feb 24 '16 at 10:34
  • @MaxPD Yep, that's what I was thinking! Thanks for your help. – user3237820 Feb 24 '16 at 10:35
  • The call to as.numeric would do that for you in the second line of my proposition – Raad Feb 24 '16 at 10:37

1 Answers1

1

Some approaches to this:

df <- data.frame(id = seq( 1 , 100 , by=1 ) ,
                 x = rep( c("W", "Z") , each=50),
                 y = c( rep( c("A","B","C","D") , 25)))

# Take the product (my preference)
as.numeric(df$x) * as.numeric(df$y)

# Create new factor based on x and y and convert to numeric
as.numeric(as.factor(paste0(df$x, df$y)))
Raad
  • 2,675
  • 1
  • 13
  • 26