0

I have 2 ArrayList's. ArrayList A has 8.1k elements and ArrayList B has 81k elements.

I need to iterate through B, search for that particular item in A then change a field in the matched element in list B.

Here's my code:

private void mapAtoB(List<A> aList, ListIterator<B> it) {
    AtomicInteger i = new AtomicInteger(-1);
    while(it.hasNext()) {
        System.out.print(i.incrementAndGet() + ", ");
        B b = it.next();
        aList.stream().filter(a -> b.equalsB(a)).forEach(a -> {
            b.setId(String.valueOf(a.getRedirectId()));
            it.set(b);
        });
    }
    System.out.println();
}

public class B {
    public boolean equalsB(A a) {
        if (a == null) return false;

        if (this.getFullURL().contains(a.getFirstName())) return true;

        return false;
    }
}

But this is taking forever. To finish this method it takes close to 15 minutes. Is there any way to optimize any of this? 15 min run time is way too much.

Richard
  • 5,840
  • 36
  • 123
  • 208
  • 1
    Use the index, Luke! – Boris the Spider Mar 16 '17 at 19:27
  • I would start by removing the System.out.print and println calls. That's most probably what takes most of the time. You should also tell what b.equalsB(a) does (i.e. post the code): you could maybe use a HashMap, and reduce the complexity to O(m) instead of O(m*n). And remove the it.set(b), which replaces the b by itself, and is thus unnecessary. Also, since every matched a replaces the ID in B that has been set by the previous matched A, you could iterate backwards, and stop the loop as soon as you've found a match. – JB Nizet Mar 16 '17 at 19:43
  • @JBNizet i did post the code for b.equalsB(a). It's right below the first method. And the B needs to be set because we're changing the id then putting it back into the list – Richard Mar 16 '17 at 19:45
  • What do the URLs in Bs look like? Do they all look the same? Can you parse them and extract the "firstName" part out of them? – JB Nizet Mar 16 '17 at 19:50
  • @JBNizet they're all different so having a standard parse would be difficult – Richard Mar 16 '17 at 19:53
  • Any limitations on fullURL of B and/or firstName of A? Minimal length? Maximal length? Distribution? (BTW, without "prints" your code with some randomly generated data and the array sizes you specified runs in 10s of seconds on my laptop.) – starikoff Mar 17 '17 at 17:16

1 Answers1

1

I'll be happy to see a good and thorough solution, meanwhile I can propose two ideas (or maybe two reincarnations of one).

The first one is to speed up searching of all objects of type A in one object of type B. For that, Rabin-Karp algorithm seems applicable and simple enough to quickly implement, and Aho-Corasick harder but will probably give better results, not sure how much better.

The other option is to limit the number of objects of type B which should be fully processed for each object of A, for that you could e.g. build an inverse N-gram index: for each fullUrl you take all its substrings of length N ("N-grams"), and you build a map from each such N-gram to a set of B's that have such N-gram in their fullUrl. When searching for an object A, you take all of its N-grams, find a set of B's for each such N-gram and intersect all these sets, the intersection will contain all B's that you should fully process. I implemented this approach quickly, for the sizes you specified it gives a 6-7 time speedup for N=4; as N grows, search becomes faster, but building the index slows down (so if you can reuse it you are probably better off choosing a bigger N). This index takes about 200 Mb for the sizes you specified, so this approach will only scale this far with the growth of the collection of B's. Assuming that all strings are longer than NGRAM_LENGTH, here's the quick and dirty code for building the index using Guava's SetMultimap, HashMultimap:

    SetMultimap<String, B> idx = HashMultimap.create();
    for (B b : bList) {
        for (int i = 0; i < b.getFullURL().length() - NGRAM_LENGTH + 1; i++) {
            idx.put(b.getFullURL().substring(i, i + NGRAM_LENGTH), b);
        }
    }

And for the search:

private void mapAtoB(List<A> aList, SetMultimap<String, B> mmap) {
    for (A a : aList) {
        Collection<B> possible = null;
        for (int i = 0; i < a.getFirstName().length() - NGRAM_LENGTH + 1; i++) {
            String ngram = a.getFirstName().substring(i, i + NGRAM_LENGTH);
            Set<B> forNgram = mmap.get(ngram);
            if (possible == null) {
                possible = new ArrayList<>(forNgram);
            } else {
                possible.retainAll(forNgram);
            }
            if (possible.size() < 20) { // it's ok to scan through 20
                break;
            }
        }
        for (B b : possible) {
            if (b.equalsB(a)) {
                b.setId(a.getRedirectId());
            }
        }
    }
}

A possible direction for optimization would be to use hashes instead of full N-grams thus reducing the memory footprint and necessity for N-gram key comparisons.

starikoff
  • 1,601
  • 19
  • 23