1

I have a huge String list (List<String>) that may contain over 10.000 unique elements (Strings), and I need to refer to this list many-many (maybe over 10.000, too) times in a loop to find out if the list contains some element(s).

For example:

/**
 * The size of this list might be over 10.000.
 */
public static final List<String> list = new ArrayList<>();

<...>
/**
 * The size of the 'x' list might be over 10.000, too.
 *
 * This method just does something with elements in the list 'x'
 * which are not in the list 'list' (for example (!), just returns them).
 */
public static List<String> findWhatsNotInList(List<String> x) {
    List<String> result = new ArrayList<>();

    for (String s : x) {
        if (list.contains(s))
            continue;
        result.add(s);
    }

    return result;
}
<...>

This method, depending on the sizes of lists list and x, can execute for several minutes, which's too long.

Is there a way to speed this process up? (Feel free to suggest anything upon the full replacement of List and loop with something else.)

EDIT: Despite the List#contains method, I might need to use List#stream and do some checks other than just String#equals (e.g. with startsWith). For example:

/**
 * The size of this list might be over 10.000.
 */
public static final List<String> list = new ArrayList<>();

<...>
/**
 * The size of the 'x' list might be over 10.000, too.
 *
 * This method just does something with strings in the list 'x'
 * which do not start with any of strings in the list 'list' (for example (!), just returns them).
 */
public static List<String> findWhatsNotInList(List<String> x) {
    List<String> result = new ArrayList<>();

    for (String s : x) {
        if (startsWithAny(s, list))
            continue;
        result.add(s);
    }

    return result;
}
<...>
/**
 * Check if the given string `s` starts with anything from the list `list`
 */
public boolean startsWithAny(String s, List<String> sw) {
    return sw.stream().filter(s::startsWith).findAny().orElse(null) != null;
}
<...>

EDIT #2: An example:

public class Test {

    private static final List<String> list = new ArrayList<>();

    static {
        for (int i = 0; i < 7; i++) {
            list.add(Integer.toString(i));
        }
    }

    public static void main(String[] args) {
        List<String> in = new ArrayList<>();

        for (int i = 0; i < 10; i++)
            in.add(Integer.toString(i));
        List<String> out = findWhatsNotInList(in);

        // Prints 7, 8 and 9 — Strings that do not start with
        // 0, 1, 2, 3, 4, 5, or 6 (Strings from the list `list`)
        out.forEach(System.out::println);
    }

    private static List<String> findWhatsNotInList(List<String> x) {
        List<String> result = new ArrayList<>();

        for (String s : x) {
            if (startsWithAny(s, list))
                continue;
            result.add(s);
        }

        return result;
    }

    private static boolean startsWithAny(String s, List<String> sw) {
        return sw.stream().filter(s::startsWith).findAny().orElse(null) != null;
    }

}
German Vekhorev
  • 339
  • 6
  • 16

1 Answers1

1

You are basically asking how to best reinvent the wheel.

The only reasonable answer is: don't.

Meaning: you want to implement large scale searching on "big data". I suggest that you instead into look into frameworks such as Solr or ElasticSearch. Because the only real answer to work with large amounts of data is too utilize "scale out" solutions. Doing that "yourself" is a serious undertaking!

If there is the slightest chance that your requirements will "grow" and more sophisticated searching is required - then spent your energy to pick the best matching technology. Instead of trying to build something that is hard to build.

The aforementioned frameworks come with certain overhead - but if used correctly they can master with terabytes of data. Nothing that you as single developer can put up will ever get close to that. And on your way you will most likely repeat the same errors that everyone is making. Or, as said you pick up tools that saw such errors and fixed them years ago.

GhostCat
  • 137,827
  • 25
  • 176
  • 248