Fast String Searching Algorithm for large strings

Question

I'm trying to implement a plagiarism detection software using pattern matching algorithms. I came across the KMP Algorithm Here and tried out the c# implementation. I can see that it's not as fast for actual documents (not strings, I uploaded two pdf documents using iText and got the implementation to check for plagiarism in these two papers. About 50 pages).

It's really slow and I have no idea how to go about this. I've looked at Boyer Moore and Rabin Karp as well.

What I am currently doing is taking each sentence in the document (split on '.') and scanning through the whole reference document (2nd document) for a match. Then taking the next sentence and so on... I am fully aware that this could be very expensive. But I have no idea how else to implement string (pattern) matching without using this approach. It's for my final year project and I was given a topic so I HAVE to use string matching. (Not allowed to do Citation based plagiarism, Semantics or Vector Space.)

The larger the text and pattern gets, the slower the algorithm gets (extremely slow, not even reasonably slow). Is there another way to go about this that I don't know? Or are there faster algorithms for me to use with this my approach?

EDIT

My code below:`

public class MatchChecker
{
    public void KMPSearch(string pattern, string text, int page)
    {
        if (pattern.Trim() != "")
        {
            int M = pattern.Length;
            int N = text.Length;

            // create lps[] that will hold the longest
            // prefix suffix values for pattern
            int[] lps = new int[M];
            int j = 0;  // index for pat[]

            // Preprocess the pattern (calculate lps[]
            // array)
            computeLPSArray(pattern, M, lps);

            int i = 0;  //index for text[]
            while (i < N)
            {
                if (pattern[j] == text[i])
                {
                    j++;
                    i++;
                }
                if (j == M)
                {
                    Console.WriteLine("Found pattern " + pattern + " at page " + page);
                    j = lps[j - 1];
                }
                //mismatch after j matches 
                else if (i < N && pattern[j] != text[i])
                {
                    //Do not match lps[0..lps[j-1]] characters,
                    //they will match anyway
                    if (j != 0)
                        j = lps[j - 1];
                    else
                        i = i + 1;
                }
            }
        }
    }

    private void computeLPSArray(string pattern, int M, int[] lps)
    {

        //length of the previous longest prefix suffix
        int length = 0;
        int i = 1;
        lps[0] = 0;     //lps[0]is always 0

        //the loop calculates lps[i] for i = 1 to M - 1
        while (i < M)
        {
            if (pattern[i] == pattern[length])
            {
                length++;
                lps[i] = length;
                i++;
            }
            else  // (pat[i] != pat[len])
            {
                // This is tricky. Consider the example.
                // AAACAAAA and i = 7. The idea is similar 
                // to search step.
                if (length != 0)
                {
                    length = lps[length - 1];

                    // Also, note that we do not increment
                    // i here
                }
                else  // if (len == 0)
                {
                    lps[i] = length;
                    i++;
                }
            }
        }
    }

    public string ReadDocPDF()
    {
        List<string> pages = new List<string>();
        PdfReader reader2 = new PdfReader(@"C:\Users\obn\Desktop\project\chapter1.pdf");
        string strText = string.Empty;

        for (int page = 1; page <= reader2.NumberOfPages; page++)
        {
            ITextExtractionStrategy its = new SimpleTextExtractionStrategy();
            PdfReader reader = new PdfReader(@"C:\Users\obn\Desktop\project\chapter1.pdf");
            String s = PdfTextExtractor.GetTextFromPage(reader, page, its);

            s = Regex.Replace(Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(s))).Replace(",", ""), "[0-9]", "").ToLower();
            pages.Add(s);
            strText = strText + s;
            reader.Close();
        }
        return strText;
    }

    public void CheckForPlag()
    {
        string document = ReadDocPDF().Trim();
        string[] sentences = document.Split(new string[] { "\n", "\t\n\r", ". ", "." }, StringSplitOptions.RemoveEmptyEntries);

        foreach(string sentence in sentences) {
            PdfReader reader2 = new PdfReader(@"C:\Users\obn\Documents\visual studio 2015\Projects\PlagDetector\PlagDetector\bin\Debug\test3.pdf");
            for (int page = 1; page <= reader2.NumberOfPages; page++)
            {
                ITextExtractionStrategy its = new SimpleTextExtractionStrategy();
                PdfReader reader = new PdfReader(@"C:\Users\obn\Documents\visual studio 2015\Projects\PlagDetector\PlagDetector\bin\Debug\test3.pdf");
                String s = PdfTextExtractor.GetTextFromPage(reader, page, its);

                s = Regex.Replace(Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(s))).Trim().Replace(".","").Replace(",","").Replace("\n", ""), "[0-9]", "").ToLower();
                KMPSearch(sentence, s, page);
                reader.Close();
            }
        }


    }
}`

1. When asking a question like this, you should include code. 2. If your code is complete and works, but could be improved, it should really be on Code Review. This site is primarily for fixing broken code. — Carcigenicate, May 23 '17 at 22:11
Thanks. You already have an accepted answer though. It's not really necessary now. It was more of a general suggestion for when you ask questions. — Carcigenicate, May 25 '17 at 02:43
each question should only be 1 question. If you have a new problem, post it as a new question. — Carcigenicate, May 25 '17 at 12:26
You don't give an actual criterion for detecting plagiarism, so it's hard to tell exactly what a solution would do. But in any case, a generalised suffix tree can be built from 2 texts in linear time, and it can be queried to find the longest substring common to both texts, again in linear time. I think this could be useful for plagiarism detection, although it will be fooled by small changes. — j_random_hacker, May 25 '17 at 16:00

score 0 · Accepted Answer · answered May 24 '17 at 01:36

0

Your algorithm is doing lot of repeated searches purely a brute force. Some of the problems features can be considered to optimize it. How do you define 'plagiarism'? content-1 and content-2 are nearly similar. Let us say >80% are same. i.e content-1 is taken 20% is changed to produce content-2.

Now, Let us try to solve: what will be cost (no.of changes) to convert content-1 to content-2? This is a well know problem in DP(dynamic programming world) as EDIT Distance problem. The standard problem talks about strings distance, but you can easily modify it for words instead of chars.

Now, the above problem will give you least no.of changes for conversion of content-1 to content-2. With the total length of content-1, we can easily calculate the % of changes to go to content-2 from content-1. If it below a fixed threshold (say 20%) then declare the plagiarism.

answered May 24 '17 at 01:36

arunk2

2,246
3
23
35

1

Okay, so how do I go about this if "part" of the document is plagiarised and not the whole document. Because from what I see, I'm to take the entire document (A) and document (B) and check for the edit distance once? Also, how can I know the exact sentence that was plagiarised with this method if I want to give a report? Because all I can see that Edit distance returns are numbers. @Arun – OBN May 25 '17 at 02:40
You can maintain the matching words and their position in some auxillary data-structure like a hash-table when you are calculating the EDIT distance. In the recursive implementation, the part 'if (str1[m-1] == str2[n-1]) return editDist(str1, str2, m-1, n-1);' is the place you may want to store these matches along with positions in both doc1 and doc2. – arunk2 May 25 '17 at 04:06

Fast String Searching Algorithm for large strings

EDIT

1 Answers1