I am trying to calculate the accuracy of participants' response (column MEM_Response
) based on the correct response (columns MEM_Correct
). The grouping variable would be the participant's ID (in this case column SERIAL
--> 15 cases per participant).
dput(example)
structure(list(MEM_Correct = c("ZLHK", "RZKX", "DGWL", "BCJSP",
"WRKTJ", "CHBXS", "HNDCWX", "SWVNDT", "WLDGPB", "DSHRKBV", "HCXLZWB",
"HDNBVZC", "BCRHKVDM", "RVTBWKFS", "NWHVZFLD", "ZLHK", "RZKX",
"DGWL", "BCJSP", "WRKTJ", "CHBXS", "HNDCWX", "SWVNDT", "WLDGPB",
"DSHRKBV", "HCXLZWB", "HDNBVZC", "BCRHKVDM", "RVTBWKFS", "NWHVZFLD"
), MEM_Response = c("ZLHK", "RZKX", "DGWL", "BCJSP", "WRKLTJ",
"CHBXS", "HNDCWX", "SWVDTN", "WLDGPB", "DSHRKBV", "HCXLZWB",
"HDNBVZC", "BCRHKVDM", "RVTBWKFS", "NWHVZFLD", "ZLHK", "RZKX",
"DGWL", "BCJSB", "WRKTJ", "CHBXA", "HDNDWX", "SWVNDT", "WLGPBD",
"DSHKRBV", "WLGJHKK", "HDBNVZC", "BCHRKVBM", "RVGBKSNM", "NWHVZWHJ"
), SERIAL = c("4444", "4444", "4444", "4444", "4444", "4444",
"4444", "4444", "4444", "4444", "4444", "4444", "4444", "4444",
"4444", "5555", "5555", "5555", "5555", "5555", "5555", "5555",
"5555", "5555", "5555", "5555", "5555", "5555", "5555", "5555"
)), row.names = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L,
12L, 13L, 14L, 15L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 25L,
26L, 27L, 28L, 29L, 30L, 31L), class = "data.frame")
I tried to calculate the accuracy (i.e. the distance between the correct and actual response) using multiple methods, but I did not receive a satisfactory output so far.
Using stringdist
for Hamming & Levenshtein distance:
Levenshtein:
example$MEM_Lev = stringdist(example$MEM_Correct, example$MEM_Response, method = c("lv"))
Hamming:
example$MEM_Ham = stringdist(example$MEM_Correct, example$MEM_Response, method = c("hamming"))
Problem: I have the Hamming distance for each case, but how would I go about calculating the accuracy per participant, eventually ending up with a range between 0 and 1 (i.e. 0 and 100% accuracy)? The problem with the Hamming distance is also that cases of different lengths (see row 5: WRKTJ vs. WRKLTJ) yield inf
. So I would probably be better off using Levenshtein distance, is that right?
I then tried the with()
function for the Levensthein distance:
with(example, levenshteinSim(example$MEM_Correct, example$MEM_Response))
This time, the values lie between 0 and 1, which is one step forward, I think. Take row 5 again: WRKTJ (5 letters) vs. WRKLTJ (6 letters) differ in that the latter has an extra "L" right in the middle. So 1 single edit (in this case deletion) would be necessary to match with the correct response. Its Levenshtein value of 0.8333 corresponds to 5/6 correct (even though the correct value only has 5). Am I using the right distance function?
And finally, my last question is:
How do I match/calculate the mean accuracy per participant? I have another df with all participants, I want to merge the output of the example means per person with the dataframe where 1 row = 1 participant.
I hope this makes sense - If not, I can try to include more information. Please feel free to suggest other methods if you believe I am not using the correct approach.
Thank you in advance!