0

I have a dataframe like so:

rel      <- c(2, 5, NA, 3, 6)
year.in  <- c(4, NA, 2, 3, 2)
year.out <- c(6, 7, NA, 5, 4)
year.1   <- c(NA, NA, NA, NA, NA)
year.2   <- c(NA, NA, NA, NA, NA)
year.3   <- c(NA, NA, NA, NA, NA)
year.4   <- c(NA, NA, NA, NA, NA)
year.5   <- c(NA, NA, NA, NA, NA) 

df <- as.data.frame(cbind(rel, year.in, year.out, year.1, year.2, year.3, 
                      year.4, year.5))

What I would like to do is update the missing values in year.1 - year.5 with the value of 'rel', but only if: (year.in >= year.i AND year.out <= year.i) (with i is 1:5)

Focussing on the just the year of entry, I came up with this:

for (i in 1:5) ifelse(df$year.in < i,
    df[paste("year", i, sep= ".")]<- NA,
    df[paste("year", i, sep= ".")]<- df["rel"])

But this merely replaces all year.i variables with the value of rel.

I have two questions:

  • how can I update the year.i variables with the 'rel' values on the conditions mentioned?

  • is it bad to use the if else statement here?

Best and thanks in advance,

Richard

Richard
  • 1,224
  • 3
  • 16
  • 32

2 Answers2

4
library(data.table)
dt = data.table(df)

for(i in 1:5) dt[year.in <= i & i <= year.out, paste0('year.', i) := rel]

dt
#   rel year.in year.out year.1 year.2 year.3 year.4 year.5
#1:   2       4        6     NA     NA     NA      2      2
#2:   5      NA        7     NA     NA     NA     NA     NA
#3:  NA       2       NA     NA     NA     NA     NA     NA
#4:   3       3        5     NA     NA      3      3      3
#5:   6       2        4     NA      6      6      6     NA
eddi
  • 49,088
  • 6
  • 104
  • 155
1

I'd melt your data using the reshape2 package:

library(reshape2)    
df.melt <- melt(df, id.vars=c('rel', 'year.in', 'year.out'))

Dig out the numeric year:

df.melt$year <- as.integer(gsub('year\\.', '', df.melt$variable))

Then use vectorized operations:

subsetter <- with(df.melt, year.in >= year & year.out <= year.out)
subsetter[is.na(subsetter)] <- FALSE
df.melt$value[subsetter] <- df.melt$rel[subsetter]

However, in your example, everything fails your condition.

Using ifelse is perfectly acceptable, however, don't do assignment inside. Instead assign its result to something, like below. The issue was that the two assignments you were performing inside the ifelse were not on each subset but instead acting as though they were just run independently.

for (i in 1:5) {
  year_col <- paste('year', i, sep='.')
  df[[year_col]] <- ifelse(df$year.in >= i & df$year.out <= i,
                         df$rel,
                         df[[year_col]])
}

To answer your bullets:

  • See above.

  • There is nothing specifically wrong with using ifelse and sometimes it is handy to do so for readability. However, it is a "looping" construct and thus can often be replaced by a more efficient vectorized solution.

Justin
  • 42,475
  • 9
  • 93
  • 111
  • Dear Justin, thank you for explicitly addressing my questions. I am sorry to say, that I have difficulty with running your code. The first piece of code works fine, but does not provide the expected output. e.g. a value of 2 in year 1, while year.in == 4 and year.out == 6. The for loop produces warnings in my case: Warning messages: "In `[<-.data.frame`(`*tmp*`, year_col, value = list(c(NA_real_, : provided 5 variables to replace 1 variables" – Richard Jul 16 '13 at 07:56
  • 1
    Thanks for the note, my mistake. The unexpected output is because you have your condition written backwards in your question. However, the warning was unintentional and is fixed by using `[[` to index the `df` by `year_col` (see my edit). – Justin Jul 16 '13 at 13:32
  • Thanks for getting back on this. Yes I was inconsistent in the condition statement :[ – Richard Jul 17 '13 at 16:19