Conditional replacement of values in dataframe with NA

Question

Not new to R, but I'm new to more advanced R techniques and I've run into an issue. I have a somewhat large dataset I'm working with (not honking big, but about 65000 rows of data total incorporating 18 trials). Link here: https://www.dropbox.com/s/qn6fldj9z6w21b2/wtvstyr%20%282%29.csv?dl=0, and I've been working with it as a dataframe. Here is the task at hand:

I need to conditionally replace velocity values based on information from the direction and Y columns on a trial by trial basis. Here are my conditions: if direction is TRUE and the first 5 values of Y are <20, I need to replace all velocity values for Trial x with NA. If direction is TRUE and the first 5 values of Y are not <20, then I only need to do it on a case-by-case basis. If direction is FALSE and the first 5 values of Y are >180, I need to replace all velocity values for Trial x with NA. If direction is FALSE and the first 5 values of Y are not >180, then I only need to do it on a case-by-case basis.

I have the following code using dplyr from a few solutions that I've found on here (mainly from dplyr replacing na values in a column based on multiple conditions):

wtvstyr <- wtvstyr %>% 
  mutate(velocity = case_when(direction == TRUE & Y<20 ~ NA_real_, TRUE ~ velocity))
wtvstyr <- wtvstyr %>%
  mutate(velocity = case_when(direction == FALSE & Y>180 ~ NA_real_, TRUE ~ velocity))

Which solves my problem on the case-by-case basis. As for discarding entire trials, I am rather stumped. I tried to do it with ifelse wrapped in a dplyr pipeline with an index for the first value, but I must confess I have no idea what I'm doing. Here is that bit of code for the TRUE/<20 conditional along these lines: Using If/Else on a data frame:

wtvstyr %>%
  group_by(Trial) %>%
  ifelse(case_when(direction == TRUE & Y[1]<20), velocity, NA_real_)

When I tried that, however, I got an unused argument error for NA.

Any help would be appreciated! And if there's a better way to do this entirely (re, masking values or some other way I don't know), any guidance would be fantastic. Thanks!

EDIT

Here is a reproducible mini-example of my dataset:

require(tidyverse)

set.seed(80)

Trial <- c(rep(1, 40), rep(2, 40))
Y <- c(sample(0:200, 80, replace=TRUE))
Time <- c(1:80)
Direction1 <- c(rep("TRUE", 10), rep("FALSE", 10))
Direction <- c(rep(Direction1, 4))
example <- data.frame(Trial, Time, Y, Direction)

example$Y2 = example$Y 

shift <- function(x, n){
  c(x[-(seq(n))], rep(NA, n))
}

example$Y2 <- shift(example$Y2, 1)
example$velocity <- as.numeric(example$Y2) - as.numeric(example$Y)
example <- example[-c(5)]

#bit of code to remove velocities when they meet conditions I don't want:
example <- example %>% 
  mutate(velocity = case_when(Direction == TRUE & Y<20 ~ NA_real_, TRUE ~ velocity))
example <- example %>%
  mutate(velocity = case_when(Direction == FALSE & Y>180 ~ NA_real_, TRUE ~ velocity))

With that second bit of code I can remove my case-by-case values (I hope this example clarifies what I mean). I'm still having trouble coding some kind of way to identify based on the first five values in Y which trials need to be discarded entirely.

So for example, in the first subsection of data where Trial==1 and Direction==TRUE, if any of the first five points of data within that subsection are <20, I need to discard all values in that section while Direction==TRUE. In my original dataset, Direction==TRUE and Direction==FALSE repeat a number of times. I need to treat each case separately.

In my set.seed that I have, the first five Y values under Trial==1 and Direction==TRUE are 138, 40, 32, 192 and 99. Here, because no values are <20 I want to keep that trial and simply remove any values thereafter that meet those conditions (as done by the code above). However, when Trial==1 and Direction==FALSE, my values are 34, 187, 53, 79 and 8. Because 187>180, I need to remove all the values corresponding to Trial==1 and Direction==FALSE. However, later on, there is another case where Trial=1 and Direction==FALSE. I want to keep that case separately and evaluate it based on the first five values. If I need to attach another column numbering what repetition of direction I'm on to keep them separated, I can do that.

Let me know if you need any more clarification and again, thank you for any help you can give.

If you're trying to keep/discard rows, try dplyr::slice or dplyr::filter. — Simon Woodward, Nov 18 '19 at 22:45
Can you create a small reproducible example and share data here using `dput` and show expected output based on that ? Please read here on how to give a [reproducible example](http://stackoverflow.com/questions/5963269). — Ronak Shah, Nov 18 '19 at 23:52
By saying "If direction is TRUE and the first 5 values of Y are not <20, then I only need to do it on a case-by-case basis.", what do you mean? Can you clarify this point? — jazzurro, Nov 19 '19 at 01:50
One more thing. I am looking into your data now. For Trial 1, I see both TRUE and FALSE. When you say "direction == TRUE", does this mean all data points in direction have TRUE or FALSE? Otherwise, what is the exact condition you have. I think you want to describe your situation more. — jazzurro, Nov 19 '19 at 01:57

score 1 · Answer 1 · answered Nov 20 '19 at 02:11

If I've gathered roughly what you're looking for, the easiest way do this is to create a special column to save those that you want to keep outside of your other conditions and manually set those in a case_when. After that, you can group_by Trial and Direction and set up a filter to just select just those Trial/Direction groups that qualify (where any value in the first five in that group are not smaller than 20 or less than 180, depending on Direction, or is otherwise a special case). From there, you can either slice to get the top 5, but in case you want the special rows, too, I've filtered.

example %>%
  mutate(Direction= as.logical(Direction)) %>% 
  mutate(is.special = case_when(
    Trial== 1 & Direction == FALSE & Y == 30 ~ TRUE,
    TRUE ~ FALSE ## This is a weird convention, but TRUE just catches if nothing else evaluates TRUE and in this case, we want that to be 
  )) %>% 
  group_by(Trial, Direction) %>% 
  filter(
    is.special |
    (Direction == TRUE & !any(Y[1:5] < 20)) |
    (Direction == FALSE & !any(Y[1:5] > 180))
    ) %>% 
  filter(
    is.special | row_number() <= 5
  )

any is a nice function that will look at the members of the group to see if any meets the condition. Since I'm negating it, you might want to use all but I wanted to use the signs you had above to keep things consistent.

Thanks for the help, GenesRus! I've been working with the code you posted and I have some questions: why do you specify in the first case_when that Y==30? And secondly, would it be easier to group_by Direction if I assigned a label variable to Direction (ie, True1, False1, True2, False2, etc.) — Molly Westbrook, Nov 20 '19 at 21:54
Oh, just to pick out a specific one that would otherwise be filtered. It sounded like you needed a special case like that. — GenesRus, Nov 21 '19 at 17:01
Direction is fine as it is, I think. It's better not to fill up your tibble with data that's elsewhere, imo. It's a bit weird it was treating it as a factor, but that's easily fixed and may not be necessary in your original data. You can group_by as many variables as you want, so I'd leave it unless you'd add to the information content somehow with the new Direction. — GenesRus, Nov 21 '19 at 17:04

score 1 · Accepted Answer · answered Dec 03 '19 at 20:47

Using the code that GenesRus handed me, I was able to modify the code to select the trials that I want:

trialdata_filter <- trialdata %>%
  mutate(direction= as.logical(direction)) %>% 
  mutate(is.special = case_when(direction == FALSE & Y > 180 ~ TRUE, direction == TRUE & Y <20 ~ TRUE, TRUE ~ FALSE)) %>% 
  group_by(bartrial) %>%
  filter(!any(is.special[1:25] == TRUE))

Thanks for the help!

Conditional replacement of values in dataframe with NA

2 Answers2