2

I have a group of words:

"dog", "car", "house", "work", "cat"

I need to be able to match at least 3 of them in a text, for example:

"I always let my cat and dog at the animal nursery when I go to work by car"

Here I want to match the regex because it matches at least 3 words (4 words here):

"cat", "dog", "car" and "work"

EDIT 1

I want to use it with Oracle's regexp_like function

EDIT 2

I also need it to work with consecutive words

  • 1
    I want to use it with oracle's regexp_like function – Francis Eisenhower Jun 02 '18 at 22:15
  • Please clarify if “dog dog dog” should be a match (all answers so far would match this) – Bohemian Jun 02 '18 at 23:09
  • Must the solution use `regexp_like`? – Bohemian Jun 02 '18 at 23:10
  • Or any oracle regex function that gets the job done – Francis Eisenhower Jun 02 '18 at 23:28
  • I personally favor that every word only gets matched once – Francis Eisenhower Jun 02 '18 at 23:29
  • then “dog dog dog” should be a match (favor != require) – Bohemian Jun 03 '18 at 01:10
  • 1
    Do you need exact word matches? What if your input string contains the word 'dogs' - does that match 'dog'? How about capitalization: 'Work is great......' - does 'Work' match 'work'? How about compound words - does 'dogwood' (a kind of tree) match 'dog'? How about 'doghouse' - does that count twice (it matches both 'dog' and 'house')? Did you even think about any of these questions? If you didn't, don't you need to - before you think about any possible solution? –  Jun 03 '18 at 03:20
  • @mathguy You are absulutely right about this, for now I need an exact match – Francis Eisenhower Jun 03 '18 at 08:07
  • OK - you still need to define "exact match". For example, in the answer you marked Correct, a search word will not be considered an exact match if it is immediately followed by a comma, period or question mark (if it is at the end of a sentence) - is that OK? Probably not. You need to give a complete definition of "exact match", then test the solutions to make sure they match that definition in all cases. –  Jun 03 '18 at 12:36
  • Yes, you're right in that this question does need to be much better defined for the best solution to be found. What characters could appear in the input, when does a word count as a match and when does it not, etc. – Callum Watkins Jun 03 '18 at 14:56

5 Answers5

3

Since Oracle's regexp_like doesn't support non-capturing groups and word boundaries, the following expression can be used:

^((.*? )?(dog|car|house|work|cat)( |$)){3}.*$

Try it out here.

Alternatively, a larger but arguably cleaner solution is:

^(.*? )?(dog|car|house|work|cat) .*?(dog|car|house|work|cat) .*?(dog|car|house|work|cat)( .*)?$

Try it out here.

NOTE: These will both match the same word used multiple times, e.g. "dog dog dog".

EDIT: To address the concerns over punctuation, a small modification can be made. It isn't perfect, but should match 99% of situations involving punctuation (but won't match e.g. !dog):

^((.*? )?(dog|car|house|work|cat)([ ,.!?]|$)){3}.*$

Try it out here

Callum Watkins
  • 2,844
  • 4
  • 29
  • 49
2

This is a solution that doesn't use regular expressions, will exclude repeated words and the words to match can be passed in as a bind parameter in a collection:

SQL Fiddle

Oracle 11g R2 Schema Setup:

Create a collection type to store a list of words:

CREATE TYPE StringList IS TABLE OF VARCHAR2(50)
/

Create a PL/SQL function to split a delimited string into the collection:

CREATE OR REPLACE FUNCTION split_String(
  i_str    IN  VARCHAR2,
  i_delim  IN  VARCHAR2 DEFAULT ','
) RETURN StringList DETERMINISTIC
AS
  p_result       StringList := StringList();
  p_start        NUMBER(5) := 1;
  p_end          NUMBER(5);
  c_len CONSTANT NUMBER(5) := LENGTH( i_str );
  c_ld  CONSTANT NUMBER(5) := LENGTH( i_delim );
BEGIN
  IF c_len > 0 THEN
    p_end := INSTR( i_str, i_delim, p_start );
    WHILE p_end > 0 LOOP
      p_result.EXTEND;
      p_result( p_result.COUNT ) := SUBSTR( i_str, p_start, p_end - p_start );
      p_start := p_end + c_ld;
      p_end := INSTR( i_str, i_delim, p_start );
    END LOOP;
    IF p_start <= c_len + 1 THEN
      p_result.EXTEND;
      p_result( p_result.COUNT ) := SUBSTR( i_str, p_start, c_len - p_start + 1 );
    END IF;
  END IF;
  RETURN p_result;
END;
/

Create some test data:

CREATE TABLE test_data ( value ) AS
SELECT 'I always let my cat and dog at the animal nursery when I go to work by car' FROM DUAL UNION ALL
SELECT 'dog dog foo bar dog' FROM DUAL
/

Query 1:

SELECT *
FROM   test_data
WHERE  CARDINALITY(
         split_string( value, ' ' )    -- Split the string into a collection
         MULTISET INTERSECT            -- Intersect it with the input words
         StringList( 'dog', 'car', 'house', 'work', 'cat' )
       ) >= 3                          -- Check that the size of the intersection
                                       -- is at least 3 items.

Results:

|                                                                      VALUE |
|----------------------------------------------------------------------------|
| I always let my cat and dog at the animal nursery when I go to work by car |
MT0
  • 143,790
  • 11
  • 59
  • 117
  • This is a very good solution and will beat any regular expression at functionality and readability. – Callum Watkins Jun 02 '18 at 23:46
  • This solution, like all the others, ignores the subtle question of "what constitutes an exact word match." For example, it will only find one word match in the phrase "My brother has a cat, a dog and a car." - because the relevant tokens will be "cat," (with the comma) and "car." (with the period). It is very likely that the requirement would be to "match" three words in this example. No amount of clever code writing can make up for lack of clarity in the requirement. –  Jun 03 '18 at 12:44
2

Ignoring the questions I asked in a Comment under the original post, here is one easy way to solve the problem, with a join and aggregation (using a HAVING condition). Note that a word like doghouse in the input will match both dog and house, etc. (Do read my comment under the original post!)

In the query below, both the input phrase and the words to match are hardcoded in factored subqueries (the WITH clause). In a serious environment, both should be in base tables, or be provided as input variables, etc.

I show how to use the standard string comparison operator LIKE. This can be changed to REGEXP_LIKE, but that is generally unneeded (and indeed a bad idea). But if you need to differentiate between 'dog' and 'dogs' (and 'dogwood'), or need case insensitive comparison, etc., you can use REGEXP_LIKE. The point of this solution is that you don't need to worry about matching THREE different words; if you know how to match ONE (whether full word match is needed, capitalization does or does not matter, etc.), then you can also, easily, match THREE words under the same rules.

with
  inputs ( input_phrase ) as (
    select
  'I always let my cat and dog at the animal nursery when I go to work by car'
    from   dual
  ),
  words ( word_to_match) as (
    select 'dog'   from dual union all
    select 'car'   from dual union all
    select 'house' from dual union all
    select 'work'  from dual union all
    select 'cat'   from dual
  )
select   input_phrase
from     inputs inner join words 
                on input_phrase like '%' || word_to_match || '%'
group by input_phrase
having   count(*) >= 3
;

INPUT_PHRASE                                                              
--------------------------------------------------------------------------
I always let my cat and dog at the animal nursery when I go to work by car
  • Nice and simple.. should perform better than others. – Kaushik Nayak Jun 03 '18 at 04:34
  • `I always keep my carrot houseplant at the cattery` would also be matched. – MT0 Jun 03 '18 at 07:28
  • @MT0 - I said the same in my Comment to the OP - I used 'dogwood' as an example similar to your 'carrot'. "Match whole words only" is an incomplete clarification though; how about plural forms of the desired words, is that OK? How about capitalization? etc. As I stated explicitly in my Answer: IF you know how to address all these additional requirements (with REGEXP, or in any other manner), then you can use the technique I showed to extend from matching ONE word to matching as many as needed. Since that was my only goal, I didn't worry about the rest. –  Jun 03 '18 at 12:23
0

The following solution will exclude repeated matches, doesn't use regular expressions (though you can if you like), and doesn't use PL/SQL.

WITH match_list ( match_word ) AS (
    SELECT 'dog' AS match_word FROM dual
     UNION ALL
    SELECT 'work' FROM dual
     UNION ALL
    SELECT 'car' FROM dual
     UNION ALL
    SELECT 'house' FROM dual
     UNION ALL
    SELECT 'cat' FROM dual
)
SELECT phrase, COUNT(*) AS unique_match_cnt, SUM(match_cnt) AS total_match_cnt
     , LISTAGG(match_word, ',') WITHIN GROUP ( ORDER BY match_word ) AS unique_matches
  FROM (
    SELECT pt.phrase, ml.match_word, COUNT(*) AS match_cnt
      FROM phrase_table pt INNER JOIN match_list ml
        ON ' ' || LOWER(pt.phrase) || ' ' LIKE '%' || ml.match_word || '%'
     GROUP BY pt.phrase, ml.match_word
) GROUP BY phrase
HAVING COUNT(*) >= 3;

The key is putting the words you want to match into a table or common table expression/subquery. If you like you can use REGEXP_LIKE() in place of LIKE though I think that would be more expensive. Skip LISTAGG() if you're not using Oracle 11g or higher, or if you don't actually need to know which words were matched, and skip LOWER() if you want a case-sensitive match.

David Faber
  • 12,277
  • 2
  • 29
  • 40
-1

If you don't need to match different words.

(?:\b(?:dog|car|house|work|cat)\b.*?){3}

I don't know if this works in your environment.

EDIT: I didn't see there is another answer almost like this one.

Schorsch
  • 1
  • 1