3

I want to make a messenger application and I want to filter an incoming String based on certain keywords. The language I am planning to use is Java but I can use Groovy too.

The keyword list will be static somewhere in a file or csv.

The keyword list size will be 100 words max (no way I will use more than 100 keywords).

The incoming string will be max 200 bytes (UTF-8)

I have seen quite a few posts saying that using keywords to filter a string is obsolete. The application I am planning to do will be simple so I don't want to mess with nlp.

Keywords may be regexes or normal words.

I know there are plenty of ways to do this but I want the fastest one. I have a read a good approach is to use HashMap but i don't see how this could be fast combined with regex.

For example an incoming string can be :

String example = "I want to gamble and drink vodka all day"

A keyword list will contain :

DRUGS
VODKA.?
GAMBLE

The example String should be filtered because it contains at least 1 words from the keyword list

EDIT*

After some replies pointing out that using regex is slow i want to find a good solution without regex.

Without using regex one of the ways to do it is to put the keywords in a set, Split the incoming string to an array then iterate over the array and check if any of the array words are contained in the set.

This will not work in some cases. For example someone can enter "I like to gambleand drinkvodka all day". This will not match.

That is one of the reasons I see regex as the only way to go with word filtering...

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
  • Fastest algorithm could be some mix of techniques based on frequency analysis and also separating regex patterns from non-regex patterns, that's why I think that question is too broad – Michal Kordas May 22 '19 at 14:38
  • Also, you won't get maximum performance if you allow any regex. Algorithm could be faster if you introduce more categories - disallowed word prefixes, disallowed word suffixes, word does not contain substring etc. – Michal Kordas May 22 '19 at 14:40
  • I want to use regex because someone can enter a world like v0dka. But maybe i can convert the incoming string to a normal. E.g replace all 0 with o, all ! and 1 with i this would be good –  May 22 '19 at 14:48
  • Java regular expressions are really bad performing. A pattern of type `(first|second|...|last)` uses backtracking and almost no optimization. Better use a solution with dfa regular expressions (brics or rexlex). Even better (but not as flexible would be multistring search (as provided in [stringsearchalgorithms](http://stringsearchalgorithms.amygdalum.net/)). – CoronA May 22 '19 at 14:57
  • Yes, if you aim for really fastest that may be the way to go, but I'd go for simplicity if only possible. – Michal Kordas May 22 '19 at 14:58
  • So without using regex the best way to do this is to put the keywords in a set, split the incoming string to an array then itterate over the array and check if any of the array words are contained in the set? This will not work in some cases. For example someone can enter "I like to gambleand drinkvodka all day". This will not match..Thats why i see regex as the only way to go with word filtering.. –  May 22 '19 at 15:00
  • There are even faster solutions with clever organized tries. Yet if the words are not separated by whitespace or if you want to introduce fuzzyness tryout [brics](https://www.brics.dk/automaton/) (faster) or [rexlex](https://github.com/almondtools/rexlex) (more flexible). Both engines do not suffer from backtracking issues. – CoronA May 22 '19 at 15:04
  • Using an array is slow because you are copying a bunch of strings. This is not a simple problem you are asking: Finding a single substring is an area with tons of research and effort with lots of tradeoffs: https://stackoverflow.com/questions/3183582/what-is-the-fastest-substring-search-algorithm Therefore, you need to decide which tradeoffs do you want: Do you want flexible words (replacing `o` with `0`)? How long are the messages? How many messages are there? How similar are the words? How long are the words? – Nathan Merrill May 22 '19 at 15:07
  • I will check your suggestions and i will come with a reply later.(i hope). Thanks for your replies i understand that this is not an easy question to answer, thats why i am asking it here because i couldn't figure it on my own –  May 22 '19 at 15:07
  • Is it true that your "words" only contain wildcards (as in `VODKA.?`) and not generic regular expressions? In this case a multistringsearch with wildcards would be the fastest (superlinear performance), yet I do not know a library implementing such an algorithm. If the possible characters are limited there is a simple and performant way to adapt a simple multistring-search-algorithm. – CoronA May 22 '19 at 15:09
  • Oh, another question is whether or not you care about startup time. Do you want a single process that takes 100ms to build a data structure, but responds in 1ms, or a process that starts up and responds in 5ms? – Nathan Merrill May 22 '19 at 15:11
  • I don't know what my "words" will be yet. In my question i use VODKA.? because i want to catch a string like "i likevodka". –  May 22 '19 at 15:14
  • Why should a process to build a data structure with 100 entries max can take 100ms? I don't really care about startup time i care about response time more. –  May 22 '19 at 15:15
  • `VODKA.?` does not match `i likevodka`? In this case a substring search would be sufficient and efficient (however java multi-substring search is also slow). Matching even `v0dka` will require wildcard search or regex search. @Nathan: Valid point: Startup time for the 'fast algorithms' (regex search, multi-string-search) is often very long, requiring 1000 searches and more to get efficient overall. – CoronA May 22 '19 at 15:25
  • So to conclude what should I try to implement? –  May 22 '19 at 15:45
  • @panospap there are many options presented here. Speed is not a science, as it depends on many variables: if you care a lot about it, you try each of them and profile. – Nathan Merrill May 22 '19 at 16:18
  • The answers are with regex I don't see a clear answer to what should I use without regex –  May 22 '19 at 16:32
  • CoronA mentioned more: brics, rexlex. You can also try generating a trie, or modifying any of the algorithms mentioned in the stackoverflow post I linked to handle multiple substrings. – Nathan Merrill May 22 '19 at 16:50
  • One of the reasons keyword search is obsolete is that users can sneak in their words in many ways. Put spaces between letters - v o d k a, or zero-width-spaces. Use at least one letter that looks the same but is a different Unicode code. E.g. use cyrillic letters that look the same as Latin ones. Ask yourself if it's even worth it. – RealSkeptic May 22 '19 at 18:36
  • @RealSkeptic so what do you insist? How an efficient keyword filter can be made? –  May 24 '19 at 07:22

3 Answers3

1

As long as you can afford some time for preprocessing following approaches are efficient:

Multi-String-Search

A search for multiple strings (needles) processes the input (haystack) char-by-char and skips sections that will never be matched by any of the specified words. It is not limited to word boundaries and often performs superlinear dependent on the length of the haystack.

The most popular algorithm is Aho-Corasick, you can find a couple of well-tested algorithms in stringsearchalgorithms

DFA-Regular-Expression-Search

A search with a regular expressions DFA (deterministic finite automaton)-engine processes the input (haystack) char-by-char and updates the engines automaton, it never skips sections and so never can perform with less then linear runtime.

The main advantage of regular expression search is that you can easily specify patterns instead of words. The main disadvantage is preprocessing time (which is worst case exponential to the pattern length). Some time ago I spent many minutes or even hours waiting for a complex regex to compile.

You may find regex search at patternsearchalgorithms, or brics

CoronA
  • 7,717
  • 2
  • 26
  • 53
  • Wow, that's awesome that there's a library that implements a bunch of algorithms so you can try them all. – Nathan Merrill May 22 '19 at 16:52
  • This is the answer i was expecting, the github project with all those benchmarks is amazing. I am marking the answer as accepted and i ll provide the solution that fits my needs later. –  May 23 '19 at 07:36
  • I will implement my word filter without using regex. Let me know if i am right. I will be using lower case english letters. For a keyword list of max size 100(each keyword max 12 characters) and an input String of max 200 characters the best way to search if this String contains at least one of the keywords is aho-corasick? –  May 23 '19 at 08:30
  • Yes, Aho-Corasick performs well with small alphabets and small patterns. – CoronA May 23 '19 at 13:46
0

Try a regular expression for exact word matches:

import java.util.Set;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class SoRegex {
    // The static set of keywords.
    static final Set<String> keywords = Set.of("DRUGS", "VODKA", "GABMBLE");

    public static void main(String[] args) {
        // Construct a regular expression that matches any of the keywords anywhere. Use
        // word boundaries '\b'.
        StringBuilder sb = new StringBuilder("^.*(\\b").append(String.join("\\b|\\b", keywords)).append("\\b).*$");
        Pattern p = Pattern.compile(sb.toString());

        String input = "I want to gamble and drink vodka all day";

        // Convert the input to uppercase since the keywords are uppercase.
        Matcher matcher = p.matcher(input.toUpperCase());
        System.out
                .println(String.format("input '%s' matches pattern '%s': %b", input, p.toString(), matcher.matches()));
    }

}

Output:

input 'I want to gamble and drink vodka all day' matches pattern '^.*(\bGABMBLE\b|\bDRUGS\b|\bVODKA\b).*$': true

Other types of keywords are left as an exercise to the reader.

  • Checking simple words using regex most probably won't be the fastest solution for the problem – Michal Kordas May 22 '19 at 14:49
  • Maybe. Alternative proposals? –  May 22 '19 at 14:50
  • this is a good solution but it only uses regex so it's not that fast. Like @michal said it should be a mix but i think this is really hard so i might only go with normal words and not regex –  May 22 '19 at 14:52
0

One of the solutions (for sure not fastest, but maybe good enough), would be to treat every entry in a list as regex and join all regexes with | to just perform single find() on matcher.

Pattern pattern = Pattern.compile("DRUGS|VODKA.?|GAMBLE");
Matcher matcher = pattern.matcher(input);
boolean result = matcher.find();
Michal Kordas
  • 10,475
  • 7
  • 58
  • 103