-1

I have data like the below:

# Create fake data frame
score <- rep(seq(1:3), 2)
id <- rep(c(2014, 2015), each = 3)
var_if_1 <- rep(c(0.1, 0.8), each = 3)
var_if_2 <- rep(c(0.9, 0.7), each = 3)
var_if_3 <- rep(c(0.6, 0.2), each = 3)
data.frame(score, id, var_if_1, var_if_2, var_if_3)

More specifically, each row is uniquely defined by two vectors in a data frame (e.g. score and id) and there are a multitude of additional columns that begin with a string (e.g. "var_if_") and end with a different number (e.g. 1,2,3). Furthermore, for a given value of score (i.e. for any row with a given score) the value of the additional variables does not vary.

I am trying to convert these data into a data frame like the below:

# Desired output data frame
score <- rep(seq(1:3), 2)
id <- rep(c(2014, 2015), each = 3)
var <- c(0.1, 0.9, 0.6, 0.8, 0.7, 0.2)
data.frame(score, id, var)

More specifically, the additional variables (var_if_#) are removed and aggregated into a single new variable (e.g. var) which takes on the value of one of the additional variable columns based on the value of score. For example, if score == 2, then var == var_if_2.

Constrains on the solution

  • Looking to use base R or dplyr().
  • Looking for a solution that generalizes to a large number of values of 'score' and corresponding columns for 'var_if_#' and rows of arbitrary ordering.

The below exemplifies the arbitrary row ordering.

score <- rep(seq(1:3), 2)
id <- rep(c(2014, 2015), each = 3)
var_if_1 <- rep(c(0.1, 0.8), each = 3)
var_if_2 <- rep(c(0.9, 0.7), each = 3)
var_if_3 <- rep(c(0.6, 0.2), each = 3)
foo <- data.frame(score, id, var_if_1, var_if_2, var_if_3)
foo[sample(1:nrow(foo)), ]  # arbitrary row order

I am also aware that I could just use ifelse() but this becomes tedious with many possible values of score (unless there is a looping approach that can reduce the tedium).

socialscientist
  • 3,759
  • 5
  • 23
  • 58

2 Answers2

4

Use matrix indexing, which avoids slow looping or apply logic:

cbind(dat[1:2], var=dat[3:5][cbind(seq_len(nrow(dat)), dat$score)])
#  score   id var
#1     1 2014 0.1
#2     2 2014 0.9
#3     3 2014 0.6
#4     1 2015 0.8
#5     2 2015 0.7
#6     3 2015 0.2

If you are specifically matching on name patterns like var_if_1 etc, then use match to get the columns to extract:

dat[cbind( seq_len(nrow(dat)), match(paste0("var_if_", dat$score), names(dat)))]
thelatemail
  • 91,185
  • 12
  • 128
  • 188
  • Any advice on doing this when I do not have the variables in a specific order? The 1:2 and 3:5 presumes 1:2 are arbitrary values, but if the variables of interest are randomly distributed across a df this solution does not work without first arrange()ing the variables. – socialscientist Jul 21 '16 at 06:06
  • @user3614648 - see my edit - it's not a big change. – thelatemail Jul 21 '16 at 06:10
  • Looks good. Is there a reason other than speed that you use 'dat[3:5]' instead of just 'dat'? Also, the solution is not a duplicate insofar as you offered a way to generalize what you provided in the cited previous case. – socialscientist Jul 21 '16 at 06:16
  • @user3614648 - not really, no, just cutting down on not potentially working with a large dataset. – thelatemail Jul 21 '16 at 06:17
  • This solution is not quite general enough upon reflection. It assumes the rows are in a specific order. In this case, we need to order the rows via: foo <- dplyr::arrange(foo, score, id). Please edit your answer and also unmark this post as a duplicate since it is NOT a duplicate. – socialscientist Jul 21 '16 at 17:03
  • @user3614648 it assumes nothing about the row order as far as I can tell. Inside the data frame each value of the id is matched to the same row. The seq_len just essentially loops over each row. – thelatemail Jul 21 '16 at 20:32
2

You can use the function apply, that will iterate over each row of your data frame. If the columns are in a specific order like in your example:

var <- apply(my_data_frame, 1, function(x) { x[x["score"] + 2] })

If you want to use the name of the column instead of their positions you could replace x["score"] + 2] by x[paste0("var_if_",x["score"])]

user1470500
  • 652
  • 5
  • 14
  • The logic makes sense. I think there might be an issue with the second part of the code (which is what I am most interested in). Won't the first "x[" before the paste0 require quotes around the results of paste0("var_if_",x["score"])? I am trying this and it is not working for all rows in my data (it only works for rows where score is 2+ digits) – socialscientist Jul 21 '16 at 05:52