I am working on a data frame df
which is as below:
Input:
TUserId SUID mid_sum final_sum
115 201 2 7
115 309 1 8
115 404 1 9
209 245 2 10
209 398 2 10
209 510 2 10
209 602 1 10
371 111 2 11
371 115 1 11
371 123 3 11
371 124 2 11
1- My data is arranged in a wide
format, where each row has a unique student ID shown as SUID
.
2- Several students can have the same teacher and hence the common teacher ID across multiple rows shown as TUserId
.
3- The data includes student scores in mid-terms and then students' final scores.
4- I am interested in finding out if there are any instances where a teacher who gave similar scores to their students on mid-terms as shown by mid_sum
gave inconsistent scores on their final exams as shown by final_sum
. If such inconsistency is found in data, I want to add a column Status
that records this inconsistency
.
Requirement:
a- For this, my rule is that if mid_sum
and final_sum
are sorted in ascending order, as I have done in this example data frame df
. I want to identify the cases where the ascending sequence breaks in either of these columns mid_sum
and final_sum
.
b- Can it be done, if the data is not sorted?
Example 1:
For example, for SUID = 309
, mid_sum
is a decrement from the previous mid_sum
. So it should be marked as inconsistent
. It should only happen for students who were marked by the same teacher TUserId
, which in this case is 115
.
Example 2:
Similarly, for SUID = 602
, mid_sum
is a decrement from the previous mid_sum
. So it should be marked as inconsistent
. Again, it is for the same teacher TUserId = 209
To elaborate further, I want an output like this:
Output:
TUserId SUID mid_sum final_sum Status
115 201 2 7 consistent
115 309 1 8 inconsistent
115 404 1 9 consistent
209 245 2 10 consistent
209 398 2 10 consistent
209 510 2 10 consistent
209 602 1 10 inconsistent
371 111 2 11 consistent
371 115 1 11 inconsistent
371 123 3 11 consistent
371 124 2 11 inconsistent
Data import dput()
The dput()
for the data frame is below:
dput(df)
structure(list(
TUserId = c(115L, 115L, 115L, 209L, 209L, 209L, 209L, 371L, 371L, 371L, 371L),
SUID = c(201L, 309L, 404L, 245L, 398L, 510L, 602L, 111L, 115L, 123L, 124L),
mid_sum = c(2L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 3L, 2L),
final_sum = c(7L, 8L, 9L, 10L, 10L, 10L, 10L, 11L, 11L, 11L, 11L)),
class = "data.frame", row.names = c(NA, -11L))
I looked for similar questions on SO and found this R - identify consecutive sequences but it does not seem to help me address my question. Another related post was Determine when a sequence of numbers has been broken in R but again, it does not help in my case.
Any advice on how to solve this problem would be greatly appreciated.
Thanks!