Best string compare algorithm for catching suspiciously similar code

Question

I'm trying to implement a system for code execution and I am looking for a way to catch suspiciously similar submited code from different users. My idea is to use the Dice's coefficient, for comparing the submited strings. Is it ok to use it for my case and if it is not, is there some better algorithms.

Is there a particular language you are targeting? If there is one specific language, converting into an abstract syntax tree and comparing logic would probably give better results — Theo Walton, May 08 '19 at 18:45
There are multiple software with this capability, why reinveint the wheel? — juvian, May 08 '19 at 19:01
Possible duplicate of [How would you code an anti plagiarism site?](https://stackoverflow.com/questions/1085048/how-would-you-code-an-anti-plagiarism-site) — nice_dev, May 08 '19 at 19:06
[This](https://www.plagiarism.org/plagiarism-checking) should help. — nice_dev, May 08 '19 at 19:08
There are some examples [here](https://www.quora.com/Are-there-any-tools-to-check-how-similar-two-source-codes-are). It really depends on your language — juvian, May 08 '19 at 19:13
The existing tools I've used for this have been based on the [Rabin–Karp algorithm.](https://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm) — erickson, May 08 '19 at 21:08

score 1 · Accepted Answer · answered May 08 '19 at 19:12

The string comparison algorithm is not the main focus imo. Dice or Levenshtein or q-grams shouldn't matter (although I am no expert).

The main thing is to convert your code into a sequence of tokens. Label the first unknown token as 1, the second as 2 ect. Then compare both the token strings. This will give you an exact match if all a person did was change some variable names.

To be more sophisticated you can give unique labels if the tokens match a keyword if with for do ect... (most languages have similar keywords). This can avoid false positives.

Example:

sample1:

name = 'fred'
print(name)

sample2:

my_name = 'harry'
print(my_name)

sample1 tokens: name, =, ', fred, ', print, (, name, )

sample1 processed tokens: 1, 2, 3, 4, 3, 5, 6, 1, 6

sample2 tokens: my_name, =, ', harry, ', print, (, my_name, )

sample2 processed tokens: 1, 2, 3, 4, 3, 5, 6, 1, 6

and now you match the processed tokens from sample1 and sample2

Best string compare algorithm for catching suspiciously similar code

1 Answers1