2

I'm trying to filter tweets based on keyword filter. The filter could have 10 words or more. So a tweet passes if it contains the keywords. The only thing I can think of is to split the tweet's text into tokens. Then I would loop over the filter words and compare every token to every words in the filter. However this way seems very slow. Suppose the keyword filter has N keywords and the number of tokens is M, then its needs O(N*M).

Is there a better approach?

Jack Twain
  • 6,273
  • 15
  • 67
  • 107
  • See if Java object your store the tweet in implements something like `contains()` method. For example, String does have it: http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#contains(java.lang.CharSequence) – PM 77-1 Jan 18 '14 at 17:27
  • well considering that a tweet is limited to 140 (?, not sure) characters, I would argue that this is fast enough. remember: first nail it, then scale it :) – kmera Jan 18 '14 at 17:27
  • it's just O(N) (# of words in your text) with a `Map` (see Hashtable in http://bigocheatsheet.com/) – zapl Jan 18 '14 at 17:29
  • Alternatively, you can dynamically build and compile a `Pattern` [http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#contains(java.lang.CharSequence))] and then loop and match through the tweets. – PM 77-1 Jan 18 '14 at 17:31
  • Are the keywords determined at compile time? or run time? –  Jan 18 '14 at 17:34
  • @MichaelT the filtering keywords are determined at compile time – Jack Twain Jan 18 '14 at 21:18

6 Answers6

5

There are a number of interesting aspects to this question and ways to approach the problem. Every one of them has trade offs.


When people go on about HashMaps and such being O(1), they're still missing some of the compile time optimizations that can be done. Knowing the set of words at compile will allow you to put it into an Enum which will then allow you to use the lesser known EnumMap (doc) and EnumSet (doc). An Enum gives you an ordinal type that then allows you to size the backing array or bitfield once and never worry about expanding it. Likewise, the hash of the enum is its ordinal value so you don't have complex hash lookups (especially of non-interend strings). The EnumSet is kind of a type safe bitfield.

import java.util.EnumSet;

public class Main {
    public static void main(String[] args) {
        EnumSet<Words> s = EnumSet.noneOf(Words.class);

        for(String a : args) {
            s.clear();
            for(String w : a.split("\\s+")) {
                try {
                    s.add(Words.valueOf(w.toUpperCase()));
                } catch (IllegalArgumentException e) {
                    // nothing really
                }
            }
            System.out.print(a);
            if(s.size() == 4) { System.out.println(": All!"); }
            else { System.out.println(": Only " + s.size()); }
        }
    }

    enum Words {
        STACK,
        SOUP,
        EXCHANGE,
        OVERFLOW
    }
}

When run with some example strings on the command line:

"stack exchange overflow soup foo"
"stack overflow"
"stack exchange blah"

One gets the results:

stack exchange overflow soup foo: All!
stack overflow: Only 2
stack exchange blah: Only 2

You've moved the what one matches to the core language, hoping its well optimized. Turns out this look like its ultimately just a Map<String,T> (and digging even further its a HashMap hidden deep within the Class class.).


You've got a String. Splitting it into tokens of some sort is unavoidable. Each token needs to be examined to see if it matches. But comparing them against all the tokens is as you've noted expensive.

However, the language of "matches exactly these strings" is a regular one. This means we can use a regular expression to filter out the words that are not going to match. The regular expression runs in O(n) time (see What is the complexity of regular expression? ).

This doesn't get rid of O(wordsInString * keyWords) because that still is the worst case (which is what O() represents), but it does mean that for unmatched words you've only spent O(charsInWord) on eliminating it.

package com.michaelt.so.keywords;

import java.util.EnumSet;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Main {
    final static Pattern pat = Pattern.compile("S(?:TACK|OUP)|EXCHANGE|OVERFLOW", Pattern.CASE_INSENSITIVE);
    public static void main(String[] args) {
        EnumSet<Words> s = EnumSet.noneOf(Words.class);
        Matcher m = pat.matcher("");
        for(String a : args) {
            s.clear();
            for(String w : a.split("\\s+")) {
                m.reset(w);
                if(m.matches()) {
                    try {
                        s.add(Words.valueOf(w.toUpperCase()));
                    } catch (IllegalArgumentException e) {
                        // nothing really
                    }
                } else {
                    System.out.println("No need to look at " + w);
                }
            }
            System.out.print(a);
            if(s.size() == 4) { System.out.println(": All!"); }
            else { System.out.println(": Only " + s.size()); }
            System.out.println();
        }
    }

    enum Words {
        STACK,
        SOUP,
        EXCHANGE,
        OVERFLOW
    }
}

And this gives the output of:

No need to look at foo
stack exchange overflow soup foo: All!

stack overflow: Only 2

No need to look at blah
stack exchange blah: Only 2

Now, the big let down. Despite all of this, it is probably still faster for Java to compute the hash of the string and look it up in a Hash to see if it exists or not.

The only thing here that would be better would be to make a regex that matches all the strings. As mentioned, it is a regular language.

(?:stack\b.+?\bexchange\b.+?\bsoup\b.+?\boverflow)|(?:soup\b.+?\bexchange\b.+?\bstack\b.+?\boverflow) ...

The above regex will match the string stack exchange pea soup overflow

There are four words here, that means 4! parts for (s1)|(s2)|(s3)|...(s24) A regex with 10 keywords approached this way would be (s1)|...|(s3628800) which could be considered to be very impractical. Possible though some engines might choke on a regex that large. Still, it would trim it down to O(n) where n is the length of the string you've got.

Further note that this is an all filter rather than an any filter or a some filter.

If you want to match one keyword out of ten, then the regex is only ten groups long. If you want to match two keywords out of ten, then its only 90 groups long (bit long, but the engine might not choke on it). This regex can be programmatically generated.

This will get you back down to O(N) time where N is the length of the tweet. No splitting required.

Community
  • 1
  • 1
0

One way I'm thinking to solve this is by creating a HashSet and put all of the tweet's text tokens inside it. Then I would loop over the words in the words filter and check if they are all in the HashSet

Jack Twain
  • 6,273
  • 15
  • 67
  • 107
0

If you have enough time for preprocessing you could build up an index: a list (in some easy to search data structure, like a tree or hash table) of all the words contained in all tweets. Each word has the ids of tweets attached that contain this word.

Then you can lookup the keywords in the index and compute the intersection of the IDs.

This technique is known as an inverted index.

Henry
  • 42,982
  • 7
  • 68
  • 84
0

Searching in a HashMap is more or less O(1) so if you store the keys in a HashMap (for example) you will only need to check M times, so it will be O(M).

Raul Guiu
  • 2,374
  • 22
  • 37
  • wouldn't LinkedHashSet be also as a hash table? i.e. if I call contain(), wouldn't that be O(1)? – Jack Twain Jan 18 '14 at 17:29
  • @AlexTwain yes, `(Linked)HashSet` is realized using a `HashMap`'s `.keySet()` – zapl Jan 18 '14 at 17:32
  • 1
    I dont see any benefit on using LinkedHashSet over HashSet. You are not interested in the order. Anyway any Hash based data structure will be good for you. – Raul Guiu Jan 18 '14 at 17:35
0

I think you can do it using HashSet with O(M+N) but if need to save some space you can also try bloom filter which gives false positive with low probability.

Vikram Bhat
  • 6,106
  • 3
  • 20
  • 19
0

it's depend:

  • Is it real time filtering ?
  • Are you going to re-run the filtering with a different set of words ?

if it's real time - it also depend on the number of works. you can the contains method or build regex and hope that it's will be fast.

if it offline work that we want to do, if you are not going to change the set of works that you can use the methods like real time, if you think that you are going to change the filter, that you will want to build the next index.

for each work, save hash where the key is the tweets id (the value is a bit) finding all the tweets with filter words, go over the word and intersect the tweet id for each word

Mzf
  • 5,210
  • 2
  • 24
  • 37