0

I wanted to compare the LCS of two files from their binary, therefore i used the usual LCS source code, and using the GenStr command to change the bytes of the file to String first. The problem is, I received memory out of bound error because comparing String has limit, therefore i am planning to use array that stores the bytes then compare it. Is it possible to use LCS algorithm to compare two arrays of bytes?

EDIT:

public static byte[] Compare(byte[] x, byte[] y) {

    int i, j;
    final int x_length = x.length;
    final int y_length = y.length;
    int n = 2048;
    int m = 2048;


    // D[i][j] = direction, L[i][j] = Length of LCS 
    int[][] D = new int[n + 1][m + 1];
    byte[][] L = new byte[n + 1][m + 1]; // { 1, 2, 3 }

    // D[i][0] = 0 for 0<=i<=n 
    // D[0][j] = 0 for  0<=j<=m 
    for (i = 1; i <= n; i++) {
        for (j = 1; j <= m; j++) {
            if (x[i - 1] == y[j - 1]) {
                D[i][j] = D[i - 1][j - 1] + 1;
                L[i][j] = 1;
            } else if (D[i - 1][j] >= D[i][j - 1]) {
                D[i][j] = D[i - 1][j];
                L[i][j] = 2;
            } else {
                D[i][j] = D[i][j - 1];
                L[i][j] = 3;
            }
        }
    }

    // Backtrack 
    ByteArrayOutputStream lcs = new ByteArrayOutputStream();
    i = n;  
    j = m;
    while (i != 0 && j != 0) {
        switch (L[i][j]) {
            case 1:   // diagonal 
                lcs.write(x[i - 1]); // Unreversed LCS
                --i;
                --j;
                break;
            case 2:  // up 
                --i;
                break;
            case 3:  // backward 
                --j;
                break;
        }
    }
    byte[] result = lcs.toByteArray();

    // Reverse:
    for (i = 0, j = result.length - 1; i < j; ++i, --j) {
        byte b = result[i];
        result[i] = result[j];
        result[j] = b;
    }
    return result;

    //While not end of file
    while(n < x_length && m < y_length){
        if(n+2048 < x.length){
            n = n+2048;
        } else {
            n = x.length;
        }

        if(m+2048 < y.length){
            m = m+2048;
        } else {
            m = y.length;
        }

    // D[i][j] = direction, L[i][j] = Length of LCS 
    int[][] D_new = new int[n + 1][m + 1];
    byte[][] L_new = new byte[n + 1][m + 1]; // { 1, 2, 3 }

    // D[i][0] = 0 for 0<=i<=n 
    // D[0][j] = 0 for  0<=j<=m 
    for (i = i+2048; i <= n; i++) {
        for (j = j+2048; j <= m; j++) {
            if (x[i - 1] == y[j - 1]) {
                D_new[i][j] = D_new[i - 1][j - 1] + 1;
                L_new[i][j] = 1;
            } else if (D_new[i - 1][j] >= D_new[i][j - 1]) {
                D_new[i][j] = D_new[i - 1][j];
                L_new[i][j] = 2;
            } else {
                D_new[i][j] = D_new[i][j - 1];
                L_new[i][j] = 3;
            }
        }
    }

    // Backtrack 
    ByteArrayOutputStream lcs_next = new ByteArrayOutputStream();
    i = n;  
    j = m;
    while (i != 0 && j != 0) {
        switch (L[i][j]) {
            case 1:   // diagonal 
                lcs_next.write(x[i - 1]); // Unreversed LCS
                --i;
                --j;
                break;
            case 2:  // up 
                --i;
                break;
            case 3:  // backward 
                --j;
                break;
        }
    }
    byte[] result_new = lcs_next.toByteArray();

    // Reverse:
    for (i = 0, j = result_new.length - 1; i < j; ++i, --j) {
        byte b = result_new[i];
        result_new[i] = result_new[j];
        result_new[j] = b;
    }
    return result_new;
    Arrays.fill(D_new, null);
    Arrays.fill(L_new, null);
    Arrays.fill(result_new, null);
    lcs_next.reset();
}
}

I tried, but haven't been able to check if this can be used or not, because of some errors.

Questions:

  1. how do you append the lcs in line (return result) and line (return result_new)?
  2. how do you clear the array so i can use it over and over again with different input? (Array.fill(D_new, null) and Array.fill(L_new, null) doesn't work)?

Thank you in advance

chiwangc
  • 3,566
  • 16
  • 26
  • 32
Anonymous
  • 1
  • 1

2 Answers2

1

There's nothing to stop you using a byte array instead. This will use half the memory of an int array, but the maximum length of it will be the same: Integer.MAX_VALUE. If you're running out of RAM, but not hitting the length limit, then this might save you.

If these are coming from files, then that's what you should be doing anyway. You really shouldn't be reading them in as entire strings. Read them byte by byte.

But the right way to do this if the files are huge (more than 2GB) is to process the files as you go, rather than reading them in beforehand, and also using a file to store the LCS data that you're creating. The nice thing about the algorithm is that all the access is localised: you scan the input files sequentially (so you don't gain anything from reading them in in advance); and you write the arrays fairly close to sequentially, by only considering the previous and current rows when you calculate a new value (so you don't gain much by having them in RAM either).

Doing it like this will allow you to scale the files arbitrarily. CPU time will then be the deciding factor. The disk cache will give you close to the same performance you'd get by reading the files in first and doing it from RAM.

chiastic-security
  • 20,430
  • 4
  • 39
  • 67
  • Thanks for the reply. As for processing the LCS while reading, how do I do that? Using linked list? – Anonymous Oct 15 '14 at 02:18
  • No, just use the file itself as your data structure. For the inputs, open the file and then read in byte by byte. For the output, open the file and write it byte by byte, seeking to the right point in the file when you need to read something back in. – chiastic-security Oct 15 '14 at 04:20
  • @Anonymous No, not at all. Use a `BufferedInputStream`, and whenever you want the next byte, read it from the stream. No need for threads at all. It's just a case of avoiding reading the whole file in first. – chiastic-security Oct 15 '14 at 06:01
  • But doesn't LCS algorithm require all strings (in my case i use bytes) to be placed in matrixes first before calculating the LCS, therefore we need to read all the bytes first? Or am i wrong? Sorry for all the questions, and thanks in advance. – Anonymous Oct 15 '14 at 06:58
  • @Anonymous No, it doesn't need them to be in a matrix, it just needs you to be able to get at them when you need them. If you look at the algorithm, it reads each input sequentially. Whether you take them from an array or from something else is irrelevant. – chiastic-security Oct 15 '14 at 07:04
0

A conversion without algorithmic consideration.

In java new initializes to 0 / 0.0 / false / null.

On the other hand prepending to lcs cannot be done out-of-the-box. However reversing an array is simple.

public static byte[] compare(byte[] x, byte[] y) {
    int i, j;
    final int n = x.length;
    final int m = y.length;
    /* D[i][j] = direction, L[i][j] = Length of LCS */
    int[][] D = new int[n + 1][m + 1];
    byte[][] L = new byte[n + 1][m + 1]; // { 1, 2, 3 }

    /* D[i][0] = 0 for 0<=i<=n */
    /* D[0][j] = 0 for  0<=j<=m */
    for (i = 1; i <= n; i++) {
        for (j = 1; j <= m; j++) {
            if (x[i - 1] == y[ - 1]) {
                D[i][j] = D[i - 1][j - 1] + 1;
                L[i][j] = 1;
            } else if (D[i - 1][j] >= D[i][j - 1]) {
                D[i][j] = D[i - 1][j];
                L[i][j] = 2;
            } else {
                D[i][j] = D[i][j - 1];
                L[i][j] = 3;
            }
        }
    }

    /* Backtrack */
    ByteArrayOutputStream lcs = new ByteArrayOutputStream();
    i = n;
    j = m;
    while (i != 0 && j != 0) {
        switch (L[i][j]) {
            case 1:   /* diagonal */
                lcs.write(x[i - 1]); // We want lcs reversed though.
                --i;
                --j;
                break;
            case 2:  /* up */
                --i;
                break;
            case 3:  /* backward */
                --j;
                break;
        }
    }
    byte[] result = lcs.toByteArray();
    // Reverse:
    for (i = 0, j = result.length - 1; i < j; ++i, --j) {
        byte b = result[i];
        result[i] = result[j];
        result[j] = b;
    }
    return result;
}
Joop Eggen
  • 107,315
  • 7
  • 83
  • 138