String .contains VS Set .contains VS Regex String.matches()

Question

I have two sets of strings which are not very long (200~500 words) in two files which looks like this:

File1          File2

this           window
that           good
word           work
java           fine
book           home

All unique words.

Now First read the strings from file (line-by-line) and store them in:

Set<String> set1 Set<String> set2: That may looks like this: [this, that, word, java, book] and [window, good, work, fine, home]

Or

String str1 String str2: That may looks like this: str1: thisthatwordjava and str2: windowgoodworkfinehome OR can be str1: this,that,word,java (separated by comma).

Now there are three ways to check the word home in which Set or String will be present:

To use set1/2.contains("home")
To use str1/2.contains("home")
To use str1/2.matches("home")

All of the above will work fine, but which one the BEST one

Note: The purpose of this question is because the frequency of checking for string is very high.

My hunch would be that Set would be better. Since I'm guessing that it compares equality using the hash of its objects. — Olian04, Aug 09 '16 at 10:34
@kennytm Not if you append commas to the beginning and the end of the input string before checking. — marstran, Aug 09 '16 at 10:34
@Olian04 -Inserting into a Set takes `O(n)` and search is `0(1)` for a good hash function. That's more than `String#contains()`. But as Kennytm mentions, `contains()` can have its own problems (also we need to split to insert into Set) — TheLostMind, Aug 09 '16 at 10:35
@kennytm Not if you first check that the input string does not contain any commas :D — marstran, Aug 09 '16 at 10:35
@kennytm Yes I know what you say. But I will not check in such case you mentioned, I just ask about the performance. — Bahramdun Adil, Aug 09 '16 at 10:36
If you need to do multiple queries, it will definitely be faster to split the string once and put the words into a `Set`. — marstran, Aug 09 '16 at 10:37
What's the reason of storing single `String` in a `Set` of `String`s? Please read [ask] and provide [mcve]. `It's unclear what you're asking`. — xenteros, Aug 09 '16 at 10:39
@BahramdunAdil Did you profile your program and find out that the Set comparison is a hotspot? — Kayaman, Aug 09 '16 at 10:40
@xenteros He wants to split the string into multiple words. It is delimited with commas. — marstran, Aug 09 '16 at 10:42
@marstran I know, however it should be contained in the question. — xenteros, Aug 09 '16 at 10:43
A bit harsh to change the question entirely (including input format) after people answered. ;) My code won't match your question anymore, but the approach still holds for me: use the Set. — haylem, Aug 09 '16 at 13:19
OK sir thank you for the answer and comments, I have updated the question entirely for people who they don't know what I mean, this was the problem that how to ask this kind of question, anyway I tried my best. — Bahramdun Adil, Aug 09 '16 at 13:27

haylem · Answer 1 · 2016-08-09T11:24:32.903

Don't Make Performance Assumptions

What makes you think that String.contains will have "better performance"?

It won't, except for very simple cases, that is if:

your list of strings is short,
the strings to compare are short,
you want to do a one-time lookup.

For all other cases, the Set approach will scale and work better. Sure you'll have a memory overhead for the Set as opposed to a single string, but the O(1) lookups will remain constant even if you want to store millions of strings and compare long strings.

The Right Data-Structure and Algorithm for the Right Job

Use the safer and more robust design, especially as here it's not a difficult solution to implement. And as you mention that you will check frequently, then a set approach is definitely better for you.

Also, String.contain will be unsafe, as if your both have matching strings and substrings your lookups will fail. As kennytm said in a comment, if we use your example, and you have the "java" string in your list, looking up "ava" will match it, which you apparently don't want.

Pick the Right Set

You may not want to use the simple HashSet or to tweak its settings though. For instance, you could consider a Guava ImmutableSet, if your set will be created only once but checked very often.

Examples

Here's what I'd do, assuming you want an immutable set (as you say you read the list of strings from a file). This is off-hand and without verification so forgive the lack of ceremonies.

Using Java 8 + Guava

import com.google.common.collect.ImmutableSet;
import com.google.common.io.Files;
import com.google.common.base.Splitter;

final Set<String> lookupTable = ImmutableSet.copyOf(
  Splitter.on(',')
    .trimResults()
    .omitEmptyStrings()
    .split(Files.asCharSource(new File("YOUR_FILE_PATH"), Charsets.UTF_8).read())
);

Season to taste with correct path, correct charset, and with or without trimming if you want to allow spaces and an empty string.

Using Only Java 8

If you don't want to use Guava and only vanilla Java, then simply do something like this in Java 8 (again, apologies, untested):

final Set<String> lookupTable =
    Files.lines(Paths.get("YOUR_FILE_PATH"))
      .map(line -> line.split(",+"))
      .map(Arrays::stream)
      .collect(toSet());

Using Java < 8

If you have Java < 8, then use the usual FileInputStream to read the file, then String.split[] or StringTokenizer to extract an array, and finally add the array entries into a Set.

If he only needs to do a single lookup, a `String.contains` would probably be faster. In that case, he would not need to split the string and add the words to a `Set`. — marstran, Aug 09 '16 at 10:44
And as I commented on kennytm's comment; if you check that the input does not contain any commas and then add a comma to the beginning and end of the string, it will be safe to use `String.contains`. — marstran, Aug 09 '16 at 10:45
It's not really a matter of the simplicity of the lookup, but of the size of the set of strings he wants to do the lookup in. Considering the low complexity of the implementation of the `Set` approach, I can't see why you'd ever go for the `String.contains` one. It's less safe, less robust, less scalable. It's only more memory efficient. And your suggestion to address to kennytm's comment only pushes more complexity in the lookup. I'd go for the right data-structure for the right job. — haylem, Aug 09 '16 at 10:46
I said if he needs a single lookup.. In that case, it is not worth to split the string and add all elements to a set. And how is that less safe, if I might ask (considering what I wrote in my second comment)? — marstran, Aug 09 '16 at 10:47
He said he doesn't need a single lookup in the question: `Note: The purpose of this question is because the frequency of checking for string is very high.` — haylem, Aug 09 '16 at 10:48
You are absolutely right if he needs to do multiple lookups though! :) — marstran, Aug 09 '16 at 11:04

score 0 · Answer 2 · answered Aug 09 '16 at 10:45

0

I guess you read the line(s) of the file into a String anyway, so splitting it and storing the substrings in a set isn't more optimal if you plan only one query.

answered Aug 09 '16 at 10:45

BlackCat

521
3
8
22

He says: `Note: The purpose of this question is because the frequency of checking for string is very high.` – haylem Aug 09 '16 at 10:47
1

@haylem Frequency of checking in the same set of words or does he need checking in different sets of words? I thought he meant the latter. – BlackCat Aug 09 '16 at 10:51
1

the way I understand it means he'll want to check for several strings frequently in one (or more) sets. Does not really matter though. Even if you understood the latter, if he wants to recheck the same sets multiple times, it will still be preferable to build the sets (and possibly cache them). Over time the set approach is more efficient, except if he's got a very narrow use case of one-time lookup for short strings in short sets that do not get reused. – haylem Aug 09 '16 at 10:54
sadly the question is really confusing, its hard to answer correctly to it. anyway I like your answer because its detailed. – BlackCat Aug 09 '16 at 11:05

ArcticLord · Answer 3 · 2016-08-09T11:09:32.617

If you want to know something about performence differences. Simply measure it. Here is a test setting for you.

final int WORDS = 10000;
final int SEARCHES = 1000000;

Set<String> strSet = new TreeSet<String>();
String strStr = "";
int[] searches = new int[SEARCHES];
Random randomGenerator = new Random();

// filling set and string
for(int i = 0; i < WORDS; i++){
    strSet.add(String.valueOf(i));
    strStr += "," + String.valueOf(i);
}

// creating searches
for(int i = 0; i < SEARCHES; i++)
    searches[i] = randomGenerator.nextInt(WORDS);

// measure set
long startTime = System.currentTimeMillis();
for(int i = 0; i < SEARCHES; i++)
    strSet.contains(String.valueOf(searches[i]));
System.out.println("set result " + (System.currentTimeMillis() - startTime));

// measure string
startTime = System.currentTimeMillis();
for(int i = 0; i < SEARCHES; i++)
    strStr.contains(String.valueOf(searches[i]));
System.out.println("string result " + (System.currentTimeMillis() - startTime));

For me the output is a meaningful proof that you should stay with a Set

set result 350
string result 14197

For a single search, `String.contains` will probably be faster. You won't get the overhead of splitting the string and adding it to a set. — marstran, Aug 09 '16 at 11:05
For a single search its faster to print all words and let my grandma search :-D — ArcticLord, Aug 09 '16 at 11:10

score 0 · Answer 4 · answered Aug 09 '16 at 10:58

0

Set should take more memory space but less execution time if given the word without comas (which can be done with a simple split)

but what i really think is the best way is the experimental proof System.currentTimeMillis()

answered Aug 09 '16 at 10:58

whyn0t

301
2
14