There are a number of interesting aspects to this question and ways to approach the problem. Every one of them has trade offs.
When people go on about HashMaps and such being O(1), they're still missing some of the compile time optimizations that can be done. Knowing the set of words at compile will allow you to put it into an Enum
which will then allow you to use the lesser known EnumMap
(doc) and EnumSet
(doc). An Enum gives you an ordinal type that then allows you to size the backing array or bitfield once and never worry about expanding it. Likewise, the hash of the enum is its ordinal value so you don't have complex hash lookups (especially of non-interend strings). The EnumSet
is kind of a type safe bitfield.
import java.util.EnumSet;
public class Main {
public static void main(String[] args) {
EnumSet<Words> s = EnumSet.noneOf(Words.class);
for(String a : args) {
s.clear();
for(String w : a.split("\\s+")) {
try {
s.add(Words.valueOf(w.toUpperCase()));
} catch (IllegalArgumentException e) {
// nothing really
}
}
System.out.print(a);
if(s.size() == 4) { System.out.println(": All!"); }
else { System.out.println(": Only " + s.size()); }
}
}
enum Words {
STACK,
SOUP,
EXCHANGE,
OVERFLOW
}
}
When run with some example strings on the command line:
"stack exchange overflow soup foo"
"stack overflow"
"stack exchange blah"
One gets the results:
stack exchange overflow soup foo: All!
stack overflow: Only 2
stack exchange blah: Only 2
You've moved the what one matches to the core language, hoping its well optimized. Turns out this look like its ultimately just a Map<String,T>
(and digging even further its a HashMap hidden deep within the Class class.).
You've got a String. Splitting it into tokens of some sort is unavoidable. Each token needs to be examined to see if it matches. But comparing them against all the tokens is as you've noted expensive.
However, the language of "matches exactly these strings" is a regular one. This means we can use a regular expression to filter out the words that are not going to match. The regular expression runs in O(n)
time (see What is the complexity of regular expression? ).
This doesn't get rid of O(wordsInString * keyWords)
because that still is the worst case (which is what O() represents), but it does mean that for unmatched words you've only spent O(charsInWord)
on eliminating it.
package com.michaelt.so.keywords;
import java.util.EnumSet;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
final static Pattern pat = Pattern.compile("S(?:TACK|OUP)|EXCHANGE|OVERFLOW", Pattern.CASE_INSENSITIVE);
public static void main(String[] args) {
EnumSet<Words> s = EnumSet.noneOf(Words.class);
Matcher m = pat.matcher("");
for(String a : args) {
s.clear();
for(String w : a.split("\\s+")) {
m.reset(w);
if(m.matches()) {
try {
s.add(Words.valueOf(w.toUpperCase()));
} catch (IllegalArgumentException e) {
// nothing really
}
} else {
System.out.println("No need to look at " + w);
}
}
System.out.print(a);
if(s.size() == 4) { System.out.println(": All!"); }
else { System.out.println(": Only " + s.size()); }
System.out.println();
}
}
enum Words {
STACK,
SOUP,
EXCHANGE,
OVERFLOW
}
}
And this gives the output of:
No need to look at foo
stack exchange overflow soup foo: All!
stack overflow: Only 2
No need to look at blah
stack exchange blah: Only 2
Now, the big let down. Despite all of this, it is probably still faster for Java to compute the hash of the string and look it up in a Hash to see if it exists or not.
The only thing here that would be better would be to make a regex that matches all the strings. As mentioned, it is a regular language.
(?:stack\b.+?\bexchange\b.+?\bsoup\b.+?\boverflow)|(?:soup\b.+?\bexchange\b.+?\bstack\b.+?\boverflow) ...
The above regex will match the string stack exchange pea soup overflow
There are four words here, that means 4! parts for (s1)|(s2)|(s3)|...(s24)
A regex with 10 keywords approached this way would be (s1)|...|(s3628800)
which could be considered to be very impractical. Possible though some engines might choke on a regex that large. Still, it would trim it down to O(n) where n is the length of the string you've got.
Further note that this is an all filter rather than an any filter or a some filter.
If you want to match one keyword out of ten, then the regex is only ten groups long. If you want to match two keywords out of ten, then its only 90 groups long (bit long, but the engine might not choke on it). This regex can be programmatically generated.
This will get you back down to O(N) time where N is the length of the tweet. No splitting required.