2

I am comparing substrings in two large text files. Very simple, tokenizing into two token containers, comparing with 2 for loops. Performance is disastrous! Does anybody have an advice or idea how to improve performance?

for (int s = 0; s < txtA.TokenContainer.size(); s++) {
    String strTxtA = txtA.getSubStr(s);
    strLengthA = txtA.getNumToken(s);

    if (strLengthA >= dp.getMinStrLength()) {
        int tokenFileB = 1;

        for (int t = 0; t < txtB.TokenContainer.size(); t++) {
            String strTxtB = txtB.getSubStr(t);
            strLengthB = txtB.getNumToken(t);

            if (strTxtA.equalsIgnoreCase(strTxtB)) {
                try {
                    subStrTemp = new SubStrTemp(
                        txtA.ID, txtB.ID, tokenFileA, tokenFileB,
                        (tokenFileA + strLengthA - 1), 
                        (tokenFileB + strLengthB - 1));

                    if (subStrContainer.contains(subStrTemp) == false) {
                        subStrContainer.addElement(subStrTemp);
                    }
                } catch (Exception ex) {
                    logger.error("error");
                }
            }
            tokenFileB += strLengthB;
        }
        tokenFileA += strLengthA;
    }
}

Generally my code reading two large Strings with Java Tokonizer into containers A and B. And then trying to compare substrings.Possision of Substrgs which are existing in both strings to store into a Vector. But performance is awful, also don't really know how to solve it with HashMap.

4 Answers4

7

Your main problem is that you go through all txtB for each token in txtA.

You should store informations on token from txtA (in a HashMap for instance) and then in a second loop (but not a nested one) you compare the strings with the existing one in the Map.


On the same topic :

Community
  • 1
  • 1
Colin Hebert
  • 91,525
  • 15
  • 160
  • 151
  • Thank you Colin HEBERT, "nested" -> "for(){ for(){} }, ""not nested" -> "for(){}, for(){}" right? Hashmap I am really afraid of.. never code coded it before. Since i know in HashMap I have to use HashSet and here redundant tokens become removed!? Ok, I don't need them, but I need their positions. Can you tell me pls, if I can store and retrieve token positions with HashMap? – jackdaniels Sep 06 '10 at 10:17
  • It's exacly this for the nested/not nested. If you want to keep the positions, you can do this `HashMap>` so you can have for each word a list of its position. Or better, instead of Integer your own structure with filename, position and other informations. – Colin Hebert Sep 06 '10 at 11:43
  • Huh... think I implemented your suggestion Colin. But somehow unable to get Hashmap paramters.. Can you have a look pls, programming code is here – jackdaniels Sep 06 '10 at 16:31
  • Wow! Your code is just beatifull!! How many years programming expirience are behind this style!? It seems to be a HaschJoin, that "meriton" suggested me before. Right? – jackdaniels Sep 06 '10 at 20:59
  • It is kind of like the "HashJoin" of @meriton. But I kept your code, so it doesn't remove punctuation and doesn't compare with lower case words. – Colin Hebert Sep 06 '10 at 21:05
  • But now, HashMap is not sorted. And positions are totally disordered... How can I now retrieve tokens in right order? To get start/end of text, if have to search every token in entire map? And the I have to summerize suceed tokens? How can I manage this without nested loops? – jackdaniels Sep 06 '10 at 21:11
  • Thank you!! Punctuation and other stuff I can manage with StringTokoniser utils ;-). It c ompares with original words, and this is what I need. Or is there any other advantage to do string uppercase? ..performance or match-number improvement? – jackdaniels Sep 06 '10 at 21:24
  • You can replace the Sets by Lists if you want to preserve the order. – Colin Hebert Sep 06 '10 at 21:41
  • Thanks a lot Colin!! It was a great example. Still I have to use 2x stringTokenizer-while-loops ( O(n^2)-?), but perfomance increases rapidly. – jackdaniels Sep 14 '10 at 17:46
  • I don't think that a faster way exists, but I would be glad to see it if it does. – Colin Hebert Sep 14 '10 at 20:28
2

You are doing a join with nested loops? Yes, that is O(n^2). What about doing a hash join instead? That is, create a map from (lowercased) strText to t and do lookups with this map rather than iterating over the token container?

meriton
  • 68,356
  • 14
  • 108
  • 175
  • Hello Meriton, Thank You for helping as well. Yes I did it, but don't wanna more. Performance was also ok with small strings, another reason with nested loops I was able to store strPositions(of same substrings) (almost) sorted in a Vector. – jackdaniels Sep 06 '10 at 10:46
  • Yes I got it already there is no way around Hashing and Mapping.. I have to lear it..:-( Can you tell me pls, how can I do a HashJoin in Java? Didn't find any java related example -especially HashJoin in google... And if do HashJoin how can I store subStr-positons, these are necessary to store. – jackdaniels Sep 06 '10 at 11:00
  • Please tell me also, why it is required to lowercase String? How can I create a "map from (lowercased) strText to t"? This sentence I didn’t really understood.. Thanks in advance. Please tell me also, why it is required to lowercase String? How can I create a "map from (lowercased) strText to t"? This sentence I didn’t really undertood.. – jackdaniels Sep 06 '10 at 11:03
  • lowercase is so that your program will consider "Token" and "token" to be the same word. – James Sep 06 '10 at 11:28
  • Yes, I know what lowercase does. But what kind benefits I will get if using in (Hash)map related context? ( ... Interger!=integer... ) – jackdaniels Sep 06 '10 at 16:35
0

Put the tokens of fileA into a trie data structure. Then when tokenising fileB you can check quite quickly if these tokens are in the trie. A few code comments would help.

James
  • 9,064
  • 3
  • 31
  • 49
  • Thanks James, which data structure would you suggest to use? – jackdaniels Sep 06 '10 at 10:21
  • More comments, with pleasure.. I am reading txt files with help of java tokenizer into a string and then trying to search substring of DocA in DocB. Doing this in 2 cases. 1st case substr-length is constat, 2nd case substr-length vary, for this reason i added "if (strLengthA >= dp.getMinStrLength())" to reduce iteretion for very short substrings. – jackdaniels Sep 06 '10 at 10:32
0

A said, this is an issue of complexity and you're algorithm runs in O(n^2) instead of O(n) using hash.

For second order improvements try to call less to functions, for example you can get the size once

sizeB = txtB.TokenContainer.size();

Depeneds on the size, you may call the container once to get an array of strings to save the getStr....

Roni

roni bar yanai
  • 1,492
  • 10
  • 12
  • Thanks Roni, I was not sure if calling functions will take some performance. But of course, especially "txtB.TokenContainer.size();" programm calls every n-times, absolutly unnecessary. – jackdaniels Sep 06 '10 at 09:59