Identify cases where data sequence changes based on other column UserIDs

Question

I am working on a data frame df which is as below:

Input:

TUserId  SUID   mid_sum final_sum
 115      201   2       7     
 115      309   1       8     
 115      404   1       9           
 209      245   2       10    
 209      398   2       10          
 209      510   2       10
 209      602   1       10
 371      111   2       11
 371      115   1       11
 371      123   3       11
 371      124   2       11

1- My data is arranged in a wide format, where each row has a unique student ID shown as SUID.

2- Several students can have the same teacher and hence the common teacher ID across multiple rows shown as TUserId.

3- The data includes student scores in mid-terms and then students' final scores.

4- I am interested in finding out if there are any instances where a teacher who gave similar scores to their students on mid-terms as shown by mid_sum gave inconsistent scores on their final exams as shown by final_sum. If such inconsistency is found in data, I want to add a column Status that records this inconsistency.

Requirement:

a- For this, my rule is that if mid_sum and final_sum are sorted in ascending order, as I have done in this example data frame df. I want to identify the cases where the ascending sequence breaks in either of these columns mid_sum and final_sum.

b- Can it be done, if the data is not sorted?

Example 1:

For example, for SUID = 309, mid_sum is a decrement from the previous mid_sum. So it should be marked as inconsistent. It should only happen for students who were marked by the same teacher TUserId, which in this case is 115.

Example 2:

Similarly, for SUID = 602, mid_sum is a decrement from the previous mid_sum. So it should be marked as inconsistent. Again, it is for the same teacher TUserId = 209

To elaborate further, I want an output like this:

Output:

TUserId  SUID   mid_sum final_sum   Status
 115      201   2       7           consistent
 115      309   1       8           inconsistent
 115      404   1       9           consistent
 209      245   2       10          consistent
 209      398   2       10          consistent
 209      510   2       10          consistent
 209      602   1       10          inconsistent
 371      111   2       11          consistent
 371      115   1       11          inconsistent
 371      123   3       11          consistent
 371      124   2       11          inconsistent

Data import dput()

The dput() for the data frame is below:

dput(df)

structure(list(
TUserId = c(115L, 115L, 115L, 209L, 209L, 209L, 209L, 371L, 371L, 371L, 371L), 
SUID = c(201L, 309L, 404L, 245L, 398L, 510L, 602L, 111L, 115L, 123L, 124L), 
mid_sum = c(2L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 3L, 2L), 
final_sum = c(7L, 8L, 9L, 10L, 10L, 10L, 10L, 11L, 11L, 11L, 11L)), 
class = "data.frame", row.names = c(NA, -11L))

I looked for similar questions on SO and found this R - identify consecutive sequences but it does not seem to help me address my question. Another related post was Determine when a sequence of numbers has been broken in R but again, it does not help in my case.

Any advice on how to solve this problem would be greatly appreciated.

Thanks!

score 2 · Answer 1 · answered Jun 15 '21 at 03:42

Here's a fairly straightforward way where we test the sign of the lagged difference. If the mid_sum difference sign is the same as the final_sum difference sign, they are "consistent".

library(dplyr)
df %>%
  arrange(TUserId, final_sum) %>%
  group_by(TUserId) %>%
  mutate(
    Status = if_else(
      sign(final_sum + 0.1 - lag(final_sum, default = 0)) == sign(mid_sum + 0.1 - lag(mid_sum, default = 0)),
      "consisent", "inconsistent"
    )
  )
# # A tibble: 11 x 5
# # Groups:   TUserId [3]
#    TUserId  SUID mid_sum final_sum Status      
#      <int> <int>   <int>     <int> <chr>       
#  1     115   201       2         7 consisent   
#  2     115   309       1         8 inconsistent
#  3     115   404       1         9 consisent   
#  4     209   245       2        10 consisent   
#  5     209   398       2        10 consisent   
#  6     209   510       2        10 consisent   
#  7     209   602       1        10 inconsistent
#  8     371   111       2        11 consisent   
#  9     371   115       1        11 inconsistent
# 10     371   123       3        11 consisent   
# 11     371   124       2        11 inconsistent

The + .1 serves to make rows where the scores stay the same count as a positive sign.

Thank you @Gregor Thomas, this seems to be working perfectly on my data. I will accept it as my solution. — Sandy, Jun 15 '21 at 10:50
Hi @Gregor Thomas, a quick question: Is it possible to modify your code so that the row number ```10``` from your results becomes ```inconsistent```? It is with reference to the row where ```TUserId = 371```, ```SUID = 123```, ```mid_sum = 3``` and ```final_sum=11```? Could you please advise on this? — Sandy, Jun 16 '21 at 07:13
Can you explain the logic? In row 3 `mid_sum` stays the same and `final_sum` increases - this is marked "consistent". Is that correct? But when the reverse happens in row 10, `mid_sum` increases by `final_sum` stays the same, you want that marked as inconsistent? But also if both stay the same as in row 5, that is consistent? Is that right? — Gregor Thomas, Jun 16 '21 at 13:16
Thanks for your quick reply. 'Yes' to all of your questions (for rows # 3, 5 and 10). — Sandy, Jun 16 '21 at 13:31

AnilGoyal · Accepted Answer · 2021-06-21T05:33:47.863

Perhaps accumulate family of functions has been designed for these situations. Using accumulate2 here -

As first argument I am passing through mid_sum
second argument is lagged value i.e. lag(mid_sum) with default as any value except NA and actual values it may take. I am taking 0 as safe
.init is provided with any value. I chose c only.
if first argument (..2) [..1 is accumulated value and not first arg] is less than ..3 i.e. second argument, return inconsistent else consistent.
Now since .init is provided the results will be one value large than provided, so stripped its first value [-1]

df <- structure(list(
  TUserId = c(115L, 115L, 115L, 209L, 209L, 209L, 209L, 371L, 371L, 371L, 371L), 
  SUID = c(201L, 309L, 404L, 245L, 398L, 510L, 602L, 111L, 115L, 123L, 124L), 
  mid_sum = c(2L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 3L, 2L), 
  final_sum = c(7L, 8L, 9L, 10L, 10L, 10L, 10L, 11L, 11L, 11L, 11L)), 
  class = "data.frame", row.names = c(NA, -11L))

library(tidyverse)

df %>%
  arrange(TUserId, final_sum) %>%
  group_by(TUserId) %>%
  mutate(status = unlist(accumulate2(mid_sum, lag(mid_sum, default = 0), .init = 'c', 
                              ~ if(..2 < ..3) 'inconsistent' else 'consistent')[-1]))
#> # A tibble: 11 x 5
#> # Groups:   TUserId [3]
#>    TUserId  SUID mid_sum final_sum status      
#>      <int> <int>   <int>     <int> <chr>       
#>  1     115   201       2         7 consistent  
#>  2     115   309       1         8 inconsistent
#>  3     115   404       1         9 consistent  
#>  4     209   245       2        10 consistent  
#>  5     209   398       2        10 consistent  
#>  6     209   510       2        10 consistent  
#>  7     209   602       1        10 inconsistent
#>  8     371   111       2        11 consistent  
#>  9     371   115       1        11 inconsistent
#> 10     371   123       3        11 consistent  
#> 11     371   124       2        11 inconsistent

^{Created on 2021-06-15 by the reprex package (v2.0.0)}

Thanks @AnilGoyal, I tried this on my original data frame. R shows little error signs ```x``` at the start and the end of the mutate statement within the RStudio environment. I am wondering why is that the case? — Sandy, Jun 15 '21 at 10:33
Can I upload an image/snapshot of my screen, is there a way to do that? — Sandy, Jun 15 '21 at 10:37
When I run the command, it does not print any error messages but just on the screen there are little ```x```'s. — Sandy, Jun 15 '21 at 10:38
you have to upload that on imgur/drive and then post a link in comments — AnilGoyal, Jun 15 '21 at 10:38