0

Using Java, I have a class which retrieves a webpage as a byte array. I then need to strip out some content if it exists. (The application monitors web pages for changes, but needs to remove session Ids from the html which are created by php, and would mean changes were detected each visit to the page).

Some of the resulting byte arrays could be 10s of 1000s bytes long. They're not stored like this - a 16 byte MD5 of the page is stored. However, it is the original full size byte array which needs to be processed.

(UPDATE - the code does not work. See comment from A.H. below) A test showing my code:

public void testSessionIDGetsRemovedFromData() throws IOException
    {

        byte[] forumContent = "<li class=\"icon-logout\"><a href=\"./ucp.php?mode=logout&amp;sid=3a4043284674572e35881e022c68fcd8\" title=\"Logout [ barry ]\" accesskey=\"x\">Logout [ barry ]</a></li>".getBytes();

        byte[] sidPattern = "&amp;sid=".getBytes();
        int sidIndex = ArrayCleaner.getPatternIndex(forumContent, sidPattern);
        assertEquals(54, sidIndex);

        // start of cleaning code
        ArrayList<Byte> forumContentList = new ArrayList<Byte>();
        forumContentList.addAll(forumContent);
        forumContentList.removeAll(Arrays.asList(sidPattern));

        byte[] forumContentCleaned = new byte[forumContentList.size()];
        for (int i = 0; i < forumContentCleaned.length; i++)
        {
            forumContentCleaned[i] = (byte)forumContentList.get(i);
        }
        //end of cleaning code

        sidIndex = ArrayCleaner.getPatternIndex(forumContentCleaned, sidPattern);
        assertEquals(-1, sidIndex);
    }

This all works fine, but I'm worried about the efficiency of the cleaning section. I had hoped to operate solely on arrays, but the ArrayList has nice built in functions to removed a collection from the ArrayList, etc, which is just what I need. So I have had to create an ArrayList of Byte, as I can't have an ArrayList of the primitive byte (can anyone tell me why?), convert the pattern to remove to another ArrayList (I suppose this could be an ArrayList all along) to pass to removeAll(). I then need to create another byte[] and cast each element of the ArrayList of Bytes to a byte and add it to the byte[].

Is there a more efficient way of doing all this? Can it be performed using arrays?

UPDATE This is the same functionality using strings:

    public void testSessionIDGetsRemovedFromDataUsingStrings() throws IOException
{       
    String forumContent = "<li class=\"icon-logout\"><a href=\"./ucp.php?mode=logout&amp;sid=3a4043284674572e35881e022c68fcd8\" title=\"Logout [ barry ]\" accesskey=\"x\">Logout [ barry ]</a></li>";
    String sidPattern = "&amp;sid=";

    int sidIndex = forumContent.indexOf(sidPattern);
    assertEquals(54, sidIndex);

    forumContent = forumContent.replaceAll(sidPattern, "");
    sidIndex = forumContent.indexOf(sidPattern);
    assertEquals(-1, sidIndex);
}

Is this as efficient as the array/arrayList method?

Thanks, Barry

barry
  • 4,037
  • 6
  • 41
  • 68
  • 3
    I'm just curious if a web page parser such as Jsoup would work better. – Hovercraft Full Of Eels Oct 30 '11 at 15:03
  • 2
    Why are you using byte arrays and not Strings? – Matt Ball Oct 30 '11 at 15:17
  • Err... to be honest, I can't think of a good answer to that. I suppose it felt natural to use a byte[] to store this data in a database. Would using strings not use more memory? – barry Oct 30 '11 at 15:36
  • You also have yet to address my question: why not simply strip the data that you ***do*** want to check with Jsoup or a similar web parser? – Hovercraft Full Of Eels Oct 30 '11 at 15:37
  • Well, I'll have to go and check it out before I can answer that. If there's an existing component that would do the job then that's great, so thanks for the pointer. – barry Oct 30 '11 at 15:42
  • Yep - it works really well with strings. I'll add the string code for you to see. Will it be any less efficient than the array/arraylist method? – barry Oct 30 '11 at 15:54

3 Answers3

5

You can use List#toArray() to convert any list to an array.

Things are a bit more complicated in this specific use case because there is no elegant way to auto-unbox (from Byte to byte) when converting the list. Good ol' Java generics. Which is a nice segue into...

So I have had to create an ArrayList of Byte, as I can't have an ArrayList of the primitive byte (can anyone tell me why?)

Because, in Java, generic type parameters cannot be primitives. See Why can Java Collections not directly store Primitives types?


Side note: as a matter of style, you should almost always declare ArrayList types as List:

List<Byte> forumContentList = new ArrayList<Byte>();

See Java - declaring from Interface type instead of Class and Type List vs type ArrayList in Java.

Community
  • 1
  • 1
Matt Ball
  • 354,903
  • 100
  • 647
  • 710
3

This all works fine, I'm worried about the efficiency of the cleaning section...

Really? Did you inspect the resulting "string"? On my machine the data in forumContentCleaned still contains the &amp;sid=... data.

That's because

forumContentList.removeAll(Arrays.asList(sidPattern));

tries to remove a List<byte[]> from a List<Byte>. This will do nothing. And even if you replace the argument of removeAll with a real List<Byte> containing the bytes of "&amp;sid=", then you will remove ALL occurences of each a, each m, each p and so forth. The resulting data will look like this:

<l cl"con-logout">< href"./uc.h?oelogout34043284674572e35881e022c68fc8" ttle....

Well, strictly speaking, the &amp;sid= part is gone, but I'm quite sure this is not what you wanted.

Therefore take a step back and think: You are doing string manipulation here, so use a StringBuilder, feed it with the String(forumContent) and do your manipulation there.

Edit

Looking at the given example input string, I guess, that also the value of sid should be removed, not only the key. This code should do it efficiently without regular expresions:

String removeSecrets(String input){
    StringBuilder sb = new StringBuilder(input);

    String sidStart = "&amp;sid=";
    String sidEnd = "\"";

    int posStart = 0;
    while ((posStart = sb.indexOf(sidStart, posStart)) >= 0) {
        int posEnd = sb.indexOf(sidEnd, posStart);
        if (posEnd < 0)     // delete as far as possible - YMMV
            posEnd = sb.length();
        sb.delete(posStart, posEnd);
    }

    return sb.toString();
}

Edit 2

Here is a small benchmark between StringBuilder and String.replaceAll:

public class ReplaceAllBenchmark {
    public static void main(String[] args) throws Throwable {
        final int N = 1000000;
        String input = "<li class=\"icon-logout\"><a href=\"./ucp.php?mode=logout&amp;sid=3a4043284674572e35881e022c68fcd8\" title=\"Logout [ barry ]\" accesskey=\"x\">Logout [ barry ]</a>&amp;sid=3a4043284674572e35881e022c68fcd8\"</li>";

        stringBuilderBench(input, N);
        regularExpressionBench(input, N);
    }

    static void stringBuilderBench(String input, final int N) throws Throwable{
        for(int run=0; run<5; ++run){
            long t1 = System.nanoTime();
            for(int i=0; i<N; ++i)
                removeSecrets(input);
            long t2 = System.nanoTime();
            System.out.println("sb: "+(t2-t1)+"ns, "+(t2-t1)/N+"ns/call");
            Thread.sleep(1000);
        }
    }

    static void regularExpressionBench(String input, final int N) throws Throwable{
        for(int run=0; run<5; ++run){
            long t1 = System.nanoTime();
            for(int i=0; i<N; ++i)
                removeSecrets2(input);
            long t2 = System.nanoTime();
            System.out.println("regexp: "+(t2-t1)+"ns, "+(t2-t1)/N+"ns/call");
            Thread.sleep(1000);
        }
    }

    static String removeSecrets2(String input){
        return input.replaceAll("&amp;sid=[^\"]*\"", "\"");
    }
}

Results:

java version "1.6.0_20"
OpenJDK Runtime Environment (IcedTea6 1.9.9) (6b20-1.9.9-0ubuntu1~10.04.2)
OpenJDK 64-Bit Server VM (build 19.0-b09, mixed mode)

sb: 538735438ns, 538ns/call
sb: 457107726ns, 457ns/call
sb: 443282145ns, 443ns/call
sb: 453978805ns, 453ns/call
sb: 458895308ns, 458ns/call
regexp: 2404818405ns, 2404ns/call
regexp: 2196834572ns, 2196ns/call
regexp: 2239056178ns, 2239ns/call
regexp: 2164337638ns, 2164ns/call
regexp: 2177091893ns, 2177ns/call
A.H.
  • 63,967
  • 15
  • 92
  • 126
  • I'm just working on it now... Apologies for posting dodgy code :-( +1 – barry Oct 30 '11 at 16:01
  • As an aside, the test did pass when I posted the question, but afterwards I had realised I'd made a mistake at the forumContentList.addAll(forumContent); line. I'd actually pasted forumContentList.addAll(forumContentList); which is nonsense but I got the JUnit green line :-/ No excuse, I know... – barry Oct 30 '11 at 16:05
  • I've looked at StringBuilder, but I thing String looks more suited to my needs, as the replaceAll() method does exactly what I need. – barry Oct 30 '11 at 16:07
  • If you use `String.replaceAll` then please note, that his uses regular expressions internally and compiles the pattern for each call. Your question was about doing thing efficiently. Say goodbye to that with `replaceAll`. – A.H. Oct 30 '11 at 16:28
  • Ok, can you suggest how to do it efficiently? – barry Oct 30 '11 at 16:35
  • What about searching for indexes myself and using substring to strip out the bits I don't want? – barry Oct 30 '11 at 16:54
  • Strange - I did some benchmarking on some massive strings and the StringBuilder is always slower than the String method. The sid pattern was not in the string, but I'm not sure that would make a difference. – barry Oct 30 '11 at 17:48
  • I tried it with the pattern in the test string and the string builder is ever so slightly faster. It's close though - for a string 81828 characters long, the string builder averages 32 ms quicker after 500 runs. Without the pattern being included, the string method is quicker by 11ms. Is this surprising to you, i.e. did you expect more of a gap between their performance? I suppose if this code was being run many many times 32ms would be a lot... – barry Oct 30 '11 at 18:10
  • Some generic comments: First: Doing _valid_ microbenchmarks is darkest magic in Java. Look around this site for examples. Second: Talking about numbers without actual code is pure guessing. Third: Comparing apples and oranges (i.e. with pattern and without - whatever this means) is random guessing. Fourth: Did you compare to a solution which also removes the _value_ after _sid_? You see - in order to talk about that _properly_ a series of small comment-boxes is not the right place. – A.H. Oct 30 '11 at 18:34
  • I was using two methods which didn't replace the value after sid. Points taken though. Thanks for all the input. – barry Oct 30 '11 at 18:51
  • @barry: Added some benchmarks. No black magic yet but a good start. – A.H. Oct 30 '11 at 19:02
1

I dont think two codes have the same function.

the first code removes all characters in the sidPattern from forumContent. the second code removes the sidPattern string from forumContnt, maybe not functional, cause replaceAll() accept the argument as regular expression pattern.

are you sure you want to remove "&sid=" rather than "&sid=3a4043284674572e35881e022c68fcd8" ?

anyway, I think String is fine, List is a little bit heavy.

QIAO Peng
  • 11
  • 1
  • Yes, eventually I want to remove the whole sid id, up to the closing ". But I was just wondering about efficiency at this point. Thanks for the answer. – barry Oct 30 '11 at 16:35
  • F**k, all I edited lost. First, List will cost m*n for removeAll() method loop, m*n means contentListSize * sidListSize. you may not avoid regular expression, cause you have to reomve the whole sid including the value of that parameter, unless you implement is yourself. – QIAO Peng Oct 30 '11 at 16:57