6

I am planning to make a Plagiarism Detector as my Computer Science Engineering final year project,for which I would like to take your suggestions on how to go about it.

I would appreciate if you could suggest which all fields in CS I need to focus on and also the language which would be the most appropriate to implement in.

deovrat singh
  • 1,220
  • 2
  • 17
  • 33

4 Answers4

10

The language is nearly irrelevant. Another questions exists that discusses this a bit more. Basically, the method suggested there is to use Google. Extract parts of the target-text, and search for them on Google.

Community
  • 1
  • 1
Sampson
  • 265,109
  • 74
  • 539
  • 565
5

I am making a plagiarism checker using Python as a hobby project. The following steps are to be followed:

  1. Tokenize the document.

  2. Remove all the stop words using NLTK library.

  3. Use GenSim library and find the most relevant words, line by line. This can be done by creating the LDA or LSA of the document.

  4. Use Google Search API to search for those words.

Note: you might have chosen to use the Google API and search the whole document at once. This will work when you are working with smaller amount of data. However when building plagiarism checker for sites and webscraped data, we will need to apply NLTK algorithms.

The Google search API will result in the top articles which have the same words which were resulted in the LDA or LSA from GenSim library functions of Python.

Hope it helped.

Sumukh Bhandarkar
  • 386
  • 1
  • 5
  • 14
0

Here is a simple code to match the similarity percentage between two file

import numpy as np
def levenshtein(seq1, seq2):
    size_x = len(seq1) + 1
    size_y = len(seq2) + 1
    matrix = np.zeros ((size_x, size_y))
    for x in range(size_x):
        matrix [x, 0] = x
    for y in range(size_y):
        matrix [0, y] = y

    for x in range(1, size_x):
        for y in range(1, size_y):
            if seq1[x-1] == seq2[y-1]:
                matrix [x,y] = min(
                    matrix[x-1, y] + 1,
                    matrix[x-1, y-1],
                    matrix[x, y-1] + 1
                )
            else:
                matrix [x,y] = min(
                    matrix[x-1,y] + 1,
                    matrix[x-1,y-1] + 1,
                    matrix[x,y-1] + 1
                )
    #print (matrix)
    return (matrix[size_x - 1, size_y - 1])

with open('original.txt', 'r') as file:
    data = file.read().replace('\n', '')
    str1=data.replace(' ', '')
with open('target.txt', 'r') as file:
    data = file.read().replace('\n', '')
    str2=data.replace(' ', '')
if(len(str1)>len(str2)):
    length=len(str1)
else:
    length=len(str2)
print(100-round((levenshtein(str1,str2)/length)*100,2),'% Similarity')

Create two files "original.txt" and "target.txt" in same directory with content.

-4

you better try python,cause its easy to develop a program using this..i'm also doing a project on plagiarism detector..i suggest u to tokenize the string first..actually it is complicated but this is the way if u r trying to develop for source code,else if u r developing plagiarism detector for text file use cosine similarity method,LCS method or simply considering position..

aNn
  • 11
  • 1