2

I have a list of phrases (phrase might consist of one or more words) in a database and an input string. I need to find out which of those phrases appear in the input string.

Is there an efficient way to perform such matching in Java?

medvaržtis
  • 128
  • 1
  • 11
  • Do you have an example of the phrase or the input string ? Many solutions can be considered using java or SQL – VirtualTroll May 17 '11 at 19:45
  • 1
    An example phrases could be "Private equity" and "Software". And let's say the input string is "US private equity house is thought to be preparing a bid worth 425-450p a share for the UK software group, which this week revealed it had received an enquiry relating to a possible takeover." For both phrases I need to get a positive answer about their existence in the string. – medvaržtis May 17 '11 at 20:06
  • @ medvaržtis: I will probably consider a data structure like aho-corasick or a suffix tree. There is not straightforward solution in java nor in sql – VirtualTroll May 17 '11 at 20:29

4 Answers4

3

A quick hack would be:

  1. Build a regexp based on the combined phrases
  2. Construct a set listing the phrases that haven't matched so far
  3. Repeatedly run find until all phrases have been found or end of input is reached, removing matches from the set of remaining phrases to find

That way, the input is traversed only once, regardless how many phrases you provide. If the regexp compiler generates an efficient matcher for multiple alternatives, this should yield decent performance. However, this depends a lot on your phrases and input string, as well as the quality of the Java regexp engine.

Sample code (tested, but not optimized or profiled for performance):

public static boolean hasAllPhrasesInInput(List<String> phrases, String input) {
    Set<String> phrasesToFind = new HashSet<String>();
    StringBuilder sb = new StringBuilder();
    for (String phrase : phrases) {
        if (sb.length() > 0) {
            sb.append('|');
        }
        sb.append(Pattern.quote(phrase));
        phrasesToFind.add(phrase.toLowerCase());
    }
    Pattern pattern = Pattern.compile(sb.toString(), Pattern.CASE_INSENSITIVE);
    Matcher matcher = pattern.matcher(input);
    while (matcher.find()) {
        phrasesToFind.remove(matcher.group().toLowerCase());
        if (phrasesToFind.isEmpty()) {
            return true;
        }
    }
    return false;
}

Some caveats:

  • The code above will match phrases as substrings of words. If only complete words should match, you will need to add word boundaries ("\b") to the generated regexps.
  • The code must be modified if some phrases may be substrings of other phrases.
  • If you need to match non-ASCII text, you should add the regexp option Pattern.UNICODE_CASE and call toLowerCase(Locale) instead of toLowerCase(), using a suitable Locale.
markusk
  • 6,477
  • 34
  • 39
0

Here is a solution using java. As you have not specified anything about the strings you use i consider a generic example

Pattern p = Pattern.compile("cat");
        // Create a matcher with an input string
Matcher m = p.matcher("one cat," +" two cats in the yard");
boolean b = m.matches();  // Should return true

Hope that helps

Reference: http://java.sun.com/developer/technicalArticles/releases/1.4regex/

Shaunak
  • 17,377
  • 5
  • 53
  • 84
  • Well, I think it should be m.find() instead of m.matches. However, I don't consider this, as well as String.contains(), as a suitable solution. I have about 1000 phrases in my database. So, for every single phrase I would have to call these methods again. I do not think it's efficient to call String.contains() or Matcher.find() 1000 times. – medvaržtis May 17 '11 at 20:24
  • 1
    I don't think you'll have performance problems using String.contains(). Pulling the 1000 matching words out of the database will most likely slower than looping through them and comparing them to a string. I tried your phrase with 1000 search words and string.contains and it took 1ms. – ScArcher2 May 17 '11 at 21:45
0
sql = "SELECT phrase " + 
  " FROM phrases " + 
  " WHERE phrase LIKE $1";     
PreparedStatement pstmt =  conn.prepareStatement (sql);
// probably repeated, if more than one input:
pstmt.setString (1, "%" + input + "%");
ResultSet rs = pstmt.executeQuery ();

A prepared statement is checked to fit to the database, and is faster for repeated invokation, so if you have more than one input, it should still be fast, performed in a loop.

Of course you could load all your phrases into RAM, into an map. Slow in preparation, it might be faster if you have multiple invocations, not just one input. But databases are often quite good efficient for search.

user unknown
  • 35,537
  • 11
  • 75
  • 121
0

You can organize the search phrases from your database into a tree based on the common beginnings. Than you can analyze your string character by character trying to match to the nodes of that tree.

Olaf
  • 6,249
  • 1
  • 19
  • 37