How to develop a Plagiarism detector?

Question

I am planning to make a Plagiarism Detector as my Computer Science Engineering final year project,for which I would like to take your suggestions on how to go about it.

I would appreciate if you could suggest which all fields in CS I need to focus on and also the language which would be the most appropriate to implement in.

Did you steal this from http://stackoverflow.com/questions/1085048/how-would-you-code-an-anti-plagiarism-site? — skaffman, Jul 28 '09 at 11:11
Did you steal this from stackoverflow.com/questions/1085048/…? — MusiGenesis, Jul 28 '09 at 11:16
Did you steal this from stackoverflow.com/questions/1085048/ ...? — Mark Rushakoff, Jul 28 '09 at 11:24
@skaffman, @MusiGenesis, @Mark - why "steal" ??? Sounds like "Microsoft have stolen windows idea from Apple" — abatishchev, Oct 21 '10 at 07:46

score 10 · Accepted Answer · edited May 23 '17 at 11:44

10

The language is nearly irrelevant. Another questions exists that discusses this a bit more. Basically, the method suggested there is to use Google. Extract parts of the target-text, and search for them on Google.

edited May 23 '17 at 11:44

Community

1
1

answered Jul 28 '09 at 11:14

Sampson

265,109
74
539
565

score 5 · Answer 2 · answered Aug 03 '17 at 18:03

I am making a plagiarism checker using Python as a hobby project. The following steps are to be followed:

Tokenize the document.
Remove all the stop words using NLTK library.
Use GenSim library and find the most relevant words, line by line. This can be done by creating the LDA or LSA of the document.
Use Google Search API to search for those words.

Note: you might have chosen to use the Google API and search the whole document at once. This will work when you are working with smaller amount of data. However when building plagiarism checker for sites and webscraped data, we will need to apply NLTK algorithms.

The Google search API will result in the top articles which have the same words which were resulted in the LDA or LSA from GenSim library functions of Python.

Hope it helped.

score 0 · Answer 3 · answered Aug 16 '20 at 17:32

Here is a simple code to match the similarity percentage between two file

import numpy as np
def levenshtein(seq1, seq2):
    size_x = len(seq1) + 1
    size_y = len(seq2) + 1
    matrix = np.zeros ((size_x, size_y))
    for x in range(size_x):
        matrix [x, 0] = x
    for y in range(size_y):
        matrix [0, y] = y

    for x in range(1, size_x):
        for y in range(1, size_y):
            if seq1[x-1] == seq2[y-1]:
                matrix [x,y] = min(
                    matrix[x-1, y] + 1,
                    matrix[x-1, y-1],
                    matrix[x, y-1] + 1
                )
            else:
                matrix [x,y] = min(
                    matrix[x-1,y] + 1,
                    matrix[x-1,y-1] + 1,
                    matrix[x,y-1] + 1
                )
    #print (matrix)
    return (matrix[size_x - 1, size_y - 1])

with open('original.txt', 'r') as file:
    data = file.read().replace('\n', '')
    str1=data.replace(' ', '')
with open('target.txt', 'r') as file:
    data = file.read().replace('\n', '')
    str2=data.replace(' ', '')
if(len(str1)>len(str2)):
    length=len(str1)
else:
    length=len(str2)
print(100-round((levenshtein(str1,str2)/length)*100,2),'% Similarity')

Create two files "original.txt" and "target.txt" in same directory with content.

score -4 · Answer 4 · answered Oct 21 '10 at 07:41

you better try python,cause its easy to develop a program using this..i'm also doing a project on plagiarism detector..i suggest u to tokenize the string first..actually it is complicated but this is the way if u r trying to develop for source code,else if u r developing plagiarism detector for text file use cosine similarity method,LCS method or simply considering position..

How to develop a Plagiarism detector?

4 Answers4