-1

I am working on an implementation of a source code plagiarism algorithm(winnowing algorithm) and have a problem where I need some help.

Example: I have a string

String test="blahello,,,,/blatestbla7234///§"§$%"%$\n\n23344)§()(§$blablayeahbla";

and transform this String to

test="blahelloblatestblablablayeahbla"

and from this string I build kgrams for example 5-grams

blahe  lahel  ahell hello  ellob  llobl .... ahbla

I save the kgrams in a list of strings but would also like to save the start and end position from the original text of every kgram, so I can reference in the end every kgram back to their original text position.

EDIT:

So my question would be how can I get the start and end position of a kgram Can anyone help me there? Do you have any idea? Thanks in advance.

vizero
  • 33
  • 5
  • Do you mean something like this? blahe.begin = 0, blahe.end = 4, lahel.begin = 1, lahel.begin = 5, ... ? – wimdetr May 09 '17 at 20:26
  • Define a class NGram with whatever properties that you need (e.g., n, value, beginIndex, endIndex, etc.). Then your n-grams are instances of NGram rather than instances of String and you can carry around whatever additional meta data that you might find useful. – Rob May 09 '17 at 21:34
  • Oh sorry my question was misleading. I have edited it. I can save it in a class but how can I get the start and end position of a kgram. I transformed the original text and a lot of chars are replaced. For example I want to get for the kgram ellob start pos: 4 and end position 13 – vizero May 10 '17 at 08:53
  • @vizero Did you mean end position 8? – wimdetr May 10 '17 at 09:39
  • No i mean 13 Want to get the original position in the not modified string. 8 would be right for the modified string. – vizero May 10 '17 at 09:44

1 Answers1

0

If you want the positions from the original string, you can't remove the non-letters first, or the information is lost. You'll either need to find the kgrams in the original string directly (more CPU time) or store the original position of each letter along with the modified string (more memory space).

Here's an implementation of the latter:

import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;

public class KGram {

    public final String str;
    public final int start;
    public final int end;

    public KGram(String str, int start, int end) {
        this.str = str;
        this.start = start;
        this.end = end;
    }

    @Override
    public String toString() {
        return "KGram[\"" + str + "\":" + start + "," + end + "]";
    }

    public static List<KGram> extractFrom(String input, int size) {
        char[] chars = new char[input.length()];
        int[] indexes = new int[input.length()];
        int len = 0;

        for (int i = 0; i < input.length(); i++) {
            char c = input.charAt(i);
            if (!Character.isLetter(c)) continue;

            chars[len] = c;
            indexes[len] = i;
            len++;
        }

        List<KGram> kgrams = new ArrayList<>();
        for (int i = 0, j = size - 1; j < len; i++, j++) {
            String str = new String(Arrays.copyOfRange(chars, i, j + 1));
            kgrams.add(new KGram(str, indexes[i], indexes[j]));
        }
        return kgrams;
    }
}

Example:

String test = "blahello,,,,/blatestbla7234///§\"§$%\"%$\n\n23344)§()(§$blablayeahbla";
List<KGram> kgrams = KGram.extractFrom(test, 5);

System.out.println(kgrams.get(4));  // prints KGram["ellob":4,13]
System.out.println(kgrams.get(26)); // prints KGram["ahbla":60,64]
Sean Van Gorder
  • 3,393
  • 26
  • 26
  • Oh sorry my question was misleading. I have edited it. I can save it in a class but how can I get the start and end position of a kgram. I transformed the original text and a lot of chars are replaced. For example I want to get for the kgram ellob start pos: 4 and end position 13 – vizero May 10 '17 at 08:53