2

I am working on a data frame df which is as below:

Input:
TUserId  SUID   mid_sum final_sum
 115      201   2       7     
 115      309   1       8     
 115      404   1       9           
 209      245   2       10    
 209      398   2       10          
 209      510   2       10
 209      602   1       10
 371      111   2       11
 371      115   1       11
 371      123   3       11
 371      124   2       11

1- My data is arranged in a wide format, where each row has a unique student ID shown as SUID.

2- Several students can have the same teacher and hence the common teacher ID across multiple rows shown as TUserId.

3- The data includes student scores in mid-terms and then students' final scores.

4- I am interested in finding out if there are any instances where a teacher who gave similar scores to their students on mid-terms as shown by mid_sum gave inconsistent scores on their final exams as shown by final_sum. If such inconsistency is found in data, I want to add a column Status that records this inconsistency.

Requirement:

a- For this, my rule is that if mid_sum and final_sum are sorted in ascending order, as I have done in this example data frame df. I want to identify the cases where the ascending sequence breaks in either of these columns mid_sum and final_sum.

b- Can it be done, if the data is not sorted?

Example 1:

For example, for SUID = 309, mid_sum is a decrement from the previous mid_sum. So it should be marked as inconsistent. It should only happen for students who were marked by the same teacher TUserId, which in this case is 115.

Example 2:

Similarly, for SUID = 602, mid_sum is a decrement from the previous mid_sum. So it should be marked as inconsistent. Again, it is for the same teacher TUserId = 209

To elaborate further, I want an output like this:

Output:
TUserId  SUID   mid_sum final_sum   Status
 115      201   2       7           consistent
 115      309   1       8           inconsistent
 115      404   1       9           consistent
 209      245   2       10          consistent
 209      398   2       10          consistent
 209      510   2       10          consistent
 209      602   1       10          inconsistent
 371      111   2       11          consistent
 371      115   1       11          inconsistent
 371      123   3       11          consistent
 371      124   2       11          inconsistent
 

Data import dput()

The dput() for the data frame is below:

dput(df)

structure(list(
TUserId = c(115L, 115L, 115L, 209L, 209L, 209L, 209L, 371L, 371L, 371L, 371L), 
SUID = c(201L, 309L, 404L, 245L, 398L, 510L, 602L, 111L, 115L, 123L, 124L), 
mid_sum = c(2L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 3L, 2L), 
final_sum = c(7L, 8L, 9L, 10L, 10L, 10L, 10L, 11L, 11L, 11L, 11L)), 
class = "data.frame", row.names = c(NA, -11L))

I looked for similar questions on SO and found this R - identify consecutive sequences but it does not seem to help me address my question. Another related post was Determine when a sequence of numbers has been broken in R but again, it does not help in my case.

Any advice on how to solve this problem would be greatly appreciated.

Thanks!

AnilGoyal
  • 25,297
  • 4
  • 27
  • 45
Sandy
  • 1,100
  • 10
  • 18

2 Answers2

2

Here's a fairly straightforward way where we test the sign of the lagged difference. If the mid_sum difference sign is the same as the final_sum difference sign, they are "consistent".

library(dplyr)
df %>%
  arrange(TUserId, final_sum) %>%
  group_by(TUserId) %>%
  mutate(
    Status = if_else(
      sign(final_sum + 0.1 - lag(final_sum, default = 0)) == sign(mid_sum + 0.1 - lag(mid_sum, default = 0)),
      "consisent", "inconsistent"
    )
  )
# # A tibble: 11 x 5
# # Groups:   TUserId [3]
#    TUserId  SUID mid_sum final_sum Status      
#      <int> <int>   <int>     <int> <chr>       
#  1     115   201       2         7 consisent   
#  2     115   309       1         8 inconsistent
#  3     115   404       1         9 consisent   
#  4     209   245       2        10 consisent   
#  5     209   398       2        10 consisent   
#  6     209   510       2        10 consisent   
#  7     209   602       1        10 inconsistent
#  8     371   111       2        11 consisent   
#  9     371   115       1        11 inconsistent
# 10     371   123       3        11 consisent   
# 11     371   124       2        11 inconsistent

The + .1 serves to make rows where the scores stay the same count as a positive sign.

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • Thank you @Gregor Thomas, this seems to be working perfectly on my data. I will accept it as my solution. – Sandy Jun 15 '21 at 10:50
  • Hi @Gregor Thomas, a quick question: Is it possible to modify your code so that the row number ```10``` from your results becomes ```inconsistent```? It is with reference to the row where ```TUserId = 371```, ```SUID = 123```, ```mid_sum = 3``` and ```final_sum=11```? Could you please advise on this? – Sandy Jun 16 '21 at 07:13
  • Can you explain the logic? In row 3 `mid_sum` stays the same and `final_sum` increases - this is marked "consistent". Is that correct? But when the reverse happens in row 10, `mid_sum` increases by `final_sum` stays the same, you want that marked as inconsistent? But also if both stay the same as in row 5, that is consistent? Is that right? – Gregor Thomas Jun 16 '21 at 13:16
  • Thanks for your quick reply. 'Yes' to all of your questions (for rows # 3, 5 and 10). – Sandy Jun 16 '21 at 13:31
1

Perhaps accumulate family of functions has been designed for these situations. Using accumulate2 here -

  • As first argument I am passing through mid_sum
  • second argument is lagged value i.e. lag(mid_sum) with default as any value except NA and actual values it may take. I am taking 0 as safe
  • .init is provided with any value. I chose c only.
  • if first argument (..2) [..1 is accumulated value and not first arg] is less than ..3 i.e. second argument, return inconsistent else consistent.
  • Now since .init is provided the results will be one value large than provided, so stripped its first value [-1]
df <- structure(list(
  TUserId = c(115L, 115L, 115L, 209L, 209L, 209L, 209L, 371L, 371L, 371L, 371L), 
  SUID = c(201L, 309L, 404L, 245L, 398L, 510L, 602L, 111L, 115L, 123L, 124L), 
  mid_sum = c(2L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 3L, 2L), 
  final_sum = c(7L, 8L, 9L, 10L, 10L, 10L, 10L, 11L, 11L, 11L, 11L)), 
  class = "data.frame", row.names = c(NA, -11L))

library(tidyverse)

df %>%
  arrange(TUserId, final_sum) %>%
  group_by(TUserId) %>%
  mutate(status = unlist(accumulate2(mid_sum, lag(mid_sum, default = 0), .init = 'c', 
                              ~ if(..2 < ..3) 'inconsistent' else 'consistent')[-1]))
#> # A tibble: 11 x 5
#> # Groups:   TUserId [3]
#>    TUserId  SUID mid_sum final_sum status      
#>      <int> <int>   <int>     <int> <chr>       
#>  1     115   201       2         7 consistent  
#>  2     115   309       1         8 inconsistent
#>  3     115   404       1         9 consistent  
#>  4     209   245       2        10 consistent  
#>  5     209   398       2        10 consistent  
#>  6     209   510       2        10 consistent  
#>  7     209   602       1        10 inconsistent
#>  8     371   111       2        11 consistent  
#>  9     371   115       1        11 inconsistent
#> 10     371   123       3        11 consistent  
#> 11     371   124       2        11 inconsistent

Created on 2021-06-15 by the reprex package (v2.0.0)

AnilGoyal
  • 25,297
  • 4
  • 27
  • 45