Calculate Levenshtein/Hamming distance by grouping variable

Question

I am trying to calculate the accuracy of participants' response (column MEM_Response) based on the correct response (columns MEM_Correct). The grouping variable would be the participant's ID (in this case column SERIAL--> 15 cases per participant).

dput(example)
structure(list(MEM_Correct = c("ZLHK", "RZKX", "DGWL", "BCJSP", 
"WRKTJ", "CHBXS", "HNDCWX", "SWVNDT", "WLDGPB", "DSHRKBV", "HCXLZWB", 
"HDNBVZC", "BCRHKVDM", "RVTBWKFS", "NWHVZFLD", "ZLHK", "RZKX", 
"DGWL", "BCJSP", "WRKTJ", "CHBXS", "HNDCWX", "SWVNDT", "WLDGPB", 
"DSHRKBV", "HCXLZWB", "HDNBVZC", "BCRHKVDM", "RVTBWKFS", "NWHVZFLD"
), MEM_Response = c("ZLHK", "RZKX", "DGWL", "BCJSP", "WRKLTJ", 
"CHBXS", "HNDCWX", "SWVDTN", "WLDGPB", "DSHRKBV", "HCXLZWB", 
"HDNBVZC", "BCRHKVDM", "RVTBWKFS", "NWHVZFLD", "ZLHK", "RZKX", 
"DGWL", "BCJSB", "WRKTJ", "CHBXA", "HDNDWX", "SWVNDT", "WLGPBD", 
"DSHKRBV", "WLGJHKK", "HDBNVZC", "BCHRKVBM", "RVGBKSNM", "NWHVZWHJ"
), SERIAL = c("4444", "4444", "4444", "4444", "4444", "4444", 
"4444", "4444", "4444", "4444", "4444", "4444", "4444", "4444", 
"4444", "5555", "5555", "5555", "5555", "5555", "5555", "5555", 
"5555", "5555", "5555", "5555", "5555", "5555", "5555", "5555"
)), row.names = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 
12L, 13L, 14L, 15L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 
26L, 27L, 28L, 29L, 30L, 31L), class = "data.frame")

I tried to calculate the accuracy (i.e. the distance between the correct and actual response) using multiple methods, but I did not receive a satisfactory output so far.

Using stringdist for Hamming & Levenshtein distance:

Levenshtein:

example$MEM_Lev = stringdist(example$MEM_Correct, example$MEM_Response, method = c("lv"))

Hamming:

example$MEM_Ham = stringdist(example$MEM_Correct, example$MEM_Response, method = c("hamming"))

Problem: I have the Hamming distance for each case, but how would I go about calculating the accuracy per participant, eventually ending up with a range between 0 and 1 (i.e. 0 and 100% accuracy)? The problem with the Hamming distance is also that cases of different lengths (see row 5: WRKTJ vs. WRKLTJ) yield inf. So I would probably be better off using Levenshtein distance, is that right?

I then tried the with() function for the Levensthein distance:

with(example, levenshteinSim(example$MEM_Correct, example$MEM_Response))

This time, the values lie between 0 and 1, which is one step forward, I think. Take row 5 again: WRKTJ (5 letters) vs. WRKLTJ (6 letters) differ in that the latter has an extra "L" right in the middle. So 1 single edit (in this case deletion) would be necessary to match with the correct response. Its Levenshtein value of 0.8333 corresponds to 5/6 correct (even though the correct value only has 5). Am I using the right distance function?

And finally, my last question is:

How do I match/calculate the mean accuracy per participant? I have another df with all participants, I want to merge the output of the example means per person with the dataframe where 1 row = 1 participant.

I hope this makes sense - If not, I can try to include more information. Please feel free to suggest other methods if you believe I am not using the correct approach.

Thank you in advance!

Don’t use `c(…)` with a single element, it doesn’t make sense. — Konrad Rudolph, Jun 25 '19 at 11:21
"Am I using the right distance function?" I think this is central to your question, and it's impossible to answer without more information. I cannot imagine a non-contrived situation where (1) the user input and the correct answer are strings, and (2) the *individual letters of that string matter/are meaningful*, and (3) the length of the string is not fixed by the format. — Mees de Vries, Jun 25 '19 at 11:22
@MeesdeVries it is an actual real-life example. The task is to memorise a series of letters that were shown individually in the correct order (see `MEM_Correct` column). Between the presentation of the letters, participants need to solve a simple equation. Their actual response - after presenting N letters and N equations - is shown in the `MEM_Response` column. I want to calculate the distance between them for a mean accuracy score. So yes, the letters are a string and the length of their response was not limited/fixed. — annedroid, Jun 25 '19 at 11:33

AkselA · Answer 1 · 2019-06-25T12:59:50.737

How you want to define 'accuracy' is a methodological decision that has to be up to you, there might be some references in the litterature, but here is one suggestion.

example$lv.dist <- stringdist(example[,1], example[,2], method="lv")
head(example)
#   MEM_Correct MEM_Response SERIAL lv.dist
# 1        ZLHK         ZLHK   4444       0
# 2        RZKX         RZKX   4444       0
# 3        DGWL         DGWL   4444       0
# 4       BCJSP        BCJSP   4444       0
# 5       WRKTJ       WRKLTJ   4444       1
# 6       CHBXS        CHBXS   4444       0

aggregate(lv.dist ~ SERIAL, example, mean)
#   SERIAL  lv.dist
# 1   4444 0.200000
# 2   5555 1.866667

aggregate(lv.dist ~ SERIAL, example, function(x) round(mean(100/(1+x)), 2))
#   SERIAL lv.dist
# 1   4444   92.22
# 2   5555   54.17

# Using stringsim()
example$lv.sim <- stringsim(example[,1], example[,2], method="lv")

(agg <- aggregate(lv.sim ~ SERIAL, example, function(x) round(mean(x)*100, 2)))
#   SERIAL lv.sim
# 1   4444  96.67
# 2   5555  73.25

# Merging two data.frames is easy as long as they have a have a 
# column in common (SERIAL in this case)    
participants <- data.frame(age=7:9, SERIAL=c(5555, 4444, 1234))

merge(participants, agg)
#   SERIAL age lv.sim
# 1   4444   9  96.67
# 2   5555   8  73.25

merge(participants, agg, all=TRUE)
#   SERIAL age lv.sim
# 1   1234   9     NA
# 2   4444   8  96.67
# 3   5555   7  73.25

Thank you very much! This works on my example data, but when I run it with the rest of the data it only calculates the lv.sim and lv.dist for 13 rows. Do I need to specify anything? And if I want to merge the final aggregated lv.sim with another df, would that work as well (matched by Serial)? — annedroid, Jun 25 '19 at 12:07
@annedroid: It's hard to tell what's going wrong with your code. I suggest you go through it step by step and see where it goes wrong. I added an example of how you can merge two data.frames — AkselA, Jun 25 '19 at 12:51

Calculate Levenshtein/Hamming distance by grouping variable

1 Answers1