How to verify that the source code is copied from web

Question

I am building a web tool to check whether the submitted content is taken from web or is it submitter own work. A plagiarism detector.

I have some idea that I can generated check sum and use that as a key to compare with other entries. However, if someone has made some small changes like including/removing comments, changing variables/function name and so on then the checksum will be different, so this approach won't work.

Any suggestions for a better way?

What do you know about the data submitted ? You cant scan the entire web and search for the content ... — Ionut Flavius Pogacian, Aug 20 '12 at 05:58
ye that's absolutely fine...but I can search some key word of problem statement on the web looking for some solution (source code) and then make a comparison up to some extent. — rspr, Aug 20 '12 at 06:04

Craig Ringer · Accepted Answer · 2012-08-20T06:57:07.877

Plagiarism detection is a special case of similarity detection. This is a big field of study that's almost as old as computer science its self. There is a lot of published research, and there just isn't a single simple answer.

See, eg, a Google Scholar search for "code similarity plagiarism" or "plagiarism detection". Regular Google searches for things like "source code similarity detection algorithm" can also be useful.

There are plenty of existing tools in the space, too, so I'm surprised you're trying to write your own.

As you've noted, a check-sum won't do the job unless the code is perfectly identical. Techniques that can help include:

Building word-frequency histograms and comparing them
Extracting comment text and looking for copied comments using text-substring matching
Extracting variable, class and method names and looking for other code that uses the same names. You have to do a lot of correction for "obvious" names that everyone will choose, and for names that're dictated by the problem, like implementing a particular interface or API. Private class member variables and the local variables inside a function or method are the most useful to compare. You will need the help of a compiler or at least syntax parser for the language to extract these.
Looking for differences in indenting style. Did the user use all-spaces indenting, except for this one function that's indented with tabs?
Comparing parse trees or token streams to strip out the effects of formatting. You'd usually have to compare individual functions, etc, not just the code as a whole.
... and lots more

What you'll have to do is produce a report that weighs all these factors and others and presents them to a human so the human can make a decision. Your tool should explain why it thinks two results are similar, not just that they are similar.

@IonutFlaviusPogacian I've apparently *always* misread that without realising. Thanks for pointing it out. — Craig Ringer, Aug 20 '12 at 06:56

score 0 · Answer 2 · answered Aug 20 '12 at 06:49

How i would aproach this, and custom enhancements can be added lately:

Remove everything that is not a letter or number;

Use explode() with empty space character as delimiter and find all the words; now you know how many words you have in that article;

Now, you must find out how many times a word apears in that article, and increase the word indicator each time that word is found in the text;

Store this into an array, like:

$words['wordX']++;

Do this also with the seccond article that you want to check with;

Now, compare them; You know the original data; some conclusions ca be made at this step;

Using the big characters, like J from John, F from Feudalism, you can also make some conclusions;

From here you may know if the article is about the same thing, and this could be the real step #1

Now, somehow you have to parse both articles, word by word, in the same time, and see the differecnce beetween them.

A student can add a own "original" sentence after each sentence/paragraph found in the original article.

Make sure that if you advance to much in the parsing process on one of the articles, you somehow keep a balanced parsing process and try to parse the seccond article until you reach that balance.

i see 2 for instructions, maybe 3, or instead of 3, a function that tryes to keep the balance in the parsing process.

Also, you have to use explode() and check sentence by sentence, and word by word from each sentence and find the similarity.

I am sure that you get the idea, but i say again, you cant parse the entire WWW.

thnkx....actually let me elaborate a bit more..I am putting an algorithmic challenge in the college portal..programmers are free to discuss and post the solution....so here I am trying to verify their submission against some of the submission avail on different forum/blog. — rspr, Aug 20 '12 at 06:57
the best way to check for plagiarism is to check word by word, sentence by sentence and use the original article as a reference point on whatever parsing process your students develop — Ionut Flavius Pogacian, Aug 20 '12 at 07:02

How to verify that the source code is copied from web

2 Answers2