1

I need to search a column(varchar) for specific whole words. I'm using the query below but not getting the desired results;

    select *
    from table1
    WHERE upper(c.name) RLIKE ('FECHADO|CIERRE|CLOSED|REVISTO. NORMAL.')

My problem is to guarantee that, for example with the word 'CLOSED', that only matches; 'Case Closed', but not 'Case Disclosed'. The query above can't match whole words only. Can anyone help me to find the best way to achieve those results, both in HIVE an IMPALA.

My best regards

leftjoin
  • 36,950
  • 8
  • 57
  • 116
  • Does this answer your question? [HIVE/PIG JOIN Based on SUBSTRING match](https://stackoverflow.com/questions/70010173/hive-pig-join-based-on-substring-match) – Matt Andruff Dec 17 '21 at 15:04
  • Thank's Matt, but unfortunately no. The instr function does not distinguish 'Closed' from 'Disclosed'. instr(column, 'Closed') finds both. Thank you anyway. –  Dec 17 '21 at 15:18
  • What REVISTO. NORMAL. should match? word REVISTO, dot, space, NORMAL dot ??? or something else – leftjoin Dec 17 '21 at 18:17

2 Answers2

0

I explained a similar solution here: HIVE/PIG JOIN Based on SUBSTRING match This will make a lot of records and take a long time to run but I think it will run better than running a like as it uses a join to complete the matching instead of searching all records.

If this is a hive only solution I would explode on " " with lateral view and then match on closed.

  1. split - makes an array based on splitting on " "

  2. LATERAL VIEW EXPLODE - changes an array into rows.

    WITH (
        select *
        from table1
        LATERAL VIEW EXPLODE(SPLIT( name, " ")) table_name as names
    ) as exploded_names
    
    select * from exploded_names
        where 
           names IN ('FECHADO','CIERRE','CLOSED','REVISTO.')
    
Matt Andruff
  • 4,974
  • 1
  • 5
  • 21
  • Hi Matt. Let me test. But I think that could be more appropriate to use a regexp. thank you in advance, i'l test it –  Dec 17 '21 at 15:48
  • If the data set is small, reexp could work just fine. The issue I see is that you are comparing every record in the entire data set. This is expensive. Doing this join allows you to spread the load to a join and that makes it more efficient. – Matt Andruff Dec 17 '21 at 16:09
0

You can add word boundary \\b to match only exact words:

rlike '(?i)\\bFECHADO\\b|\\bCIERRE\\b|\\bCLOSED\\b'

(?i) means case insensitive, no need to use UPPER.

And the last alternative in your regex pattern is REVISTO. NORMAL.

If dots in it should be literally dots, use \\.

Like this: REVISTO\\. NORMAL\\.

Dot in regexp means any character and should be shielded with two backslashes to match dot literally.

Above regex works in Hive. Unfortunately I have no Impala to test it

leftjoin
  • 36,950
  • 8
  • 57
  • 116
  • Thank you very much. I will test on impala to check if it runs OK. –  Dec 17 '21 at 23:05
  • In your opinion it is better to use (rlike) or (regexp)? –  Dec 17 '21 at 23:35
  • @LEOPOLDO these two relational operators are synonyms in Hive, see https://cwiki.apache.org/confluence/display/hive/languagemanual+udf#LanguageManualUDF-RelationalOperators – leftjoin Dec 18 '21 at 06:53
  • @LEOPOLDO In Impala regexp and rlike are also synonyms – leftjoin Dec 18 '21 at 10:29
  • Many thanks leftjoin. This worked. The only comment I need to add is regarding the '\\.' for literally dots in 'REVISTO\\. NORMAL\\.'. I've tested and works likewise without them; like this: where c.name rlike ('(?i)REVISTO. NORMAL.') –  Dec 20 '21 at 11:10
  • @LEOPOLDO Dot means any character. Check, it will match not only dots, it will match REVISTOX, REVISTOY, REVISTO. etc etc – leftjoin Dec 20 '21 at 11:12
  • 1
    Ok. You're 100% right. Thank you and my best regards. –  Dec 20 '21 at 11:15