2

We run a dictionary and have run into a problem with searches that contain an apostrophe at the start of a search string. In English words like 'twas are quite rare but in the language we're dealing with, ' is considered a word character and extremely common at the start of a phrase (for instance 's) and also at the end of words (for instance a').

Oddly enough, RegEx searches don't seem to struggle with this if it's in the middle (for example air a' bhòrd gets all the desired results) but ' at beginning or end of a search string is not treated as part of a word by RegEx.

We've ascertained this is part of the RegEx specification (only alphanumeric characters and _ are treated as part of a word) but we're wondering if it is it possible to write a RegEx expression that also treats apostrophes as part of a word?

This is what we're currently getting:

-- Demonstration on MySQL 5.6.21 Community
Select ('cat''s' REGEXP CONCAT('[[:<:]]', 'cat''s', '[[:>:]]'));
-- returns 1
Select ('''cat''s' REGEXP CONCAT('[[:<:]]' ,'''cat''s' ,'[[:>:]]' ));
-- returns 0
Select ('_cat''s' REGEXP CONCAT('[[:<:]]' ,'_cat''s' ,'[[:>:]]' ));
-- returns 1
Select ('-cat''s' REGEXP CONCAT('[[:<:]]' ,'-cat''s' ,'[[:>:]]' ));
-- returns 0
Select (' cat''s' REGEXP CONCAT('[[:<:]]' ,' cat''s' ,'[[:>:]]' ));
-- returns 0
Select ('cat''' REGEXP CONCAT('[[:<:]]' ,'cat''' ,'[[:>:]]' ));
-- returns 0

Any suggestions greatly welcomed :)

akerbeltz
  • 21
  • 3
  • Are you testing the entire column for being a "word"? Or are you checking for a "word" inside a longer string? – Rick James Jan 20 '19 at 02:41
  • Is a quoted `'cats'` a "word"? In your test case, it would be `'''cats'''`. Ditto for `"cats"`. – Rick James Jan 20 '19 at 02:44
  • If GMB's Answer turns out to be inadequate, please provide a longer list of "words" that should pass and a list of "words" that don't pass. – Rick James Jan 20 '19 at 02:46

1 Answers1

1

I think that you should provide your own definition of what a word character is, instead of relying on default ICE word boundaries ([[:<:]], [[:>:]]). From the mysql 5.6 documentation :

A word is a sequence of word characters that is not preceded by or followed by word characters. A word character is an alphanumeric character in the alnum class or an underscore (_).

That would mean : '^|[^[:alnum:]_]'

^             -- the beginning of the string
|             -- OR
[^            -- any character OTHER than
    [:alnum:]     -- an alphanumeric character
    _             -- an underscore
]

And ICE end of string would be : '[^[:alnum:]_]|$', where $ represents the end of string.

You could just modify this to add the single quote in the character class, like :

  • beginning : '^|[^[:alnum:]_'']'
  • end : '[^[:alnum:]_'']|$'

Here is your regex :

SELECT (val REGEXP CONCAT('(^|[^[:alnum:]_''])', 'cat''s', '([^[:alnum:]_'']|$)'));

See the demo on dbfiddle

Schema (MySQL v5.6)


Query #1

Select ('cat''s'
    REGEXP CONCAT('(^|[^[:alnum:]_''])', 'cat''s', '([^[:alnum:]_'']|$)')) res;
| res |
| --- |
| 1   |

Query #2

Select ('''cat''s' 
    REGEXP CONCAT('(^|[^[:alnum:]_''])', '''cat''s', '([^[:alnum:]_'']|$)' )) res;
| res |
| --- |
| 1   |

Query #3

Select ('_cat''s'
    REGEXP CONCAT('(^|[^[:alnum:]_''])', '_cat''s' , '([^[:alnum:]_'']|$)' )) res;
| res |
| --- |
| 1   |

Query #4

Select ('-cat''s'  
    REGEXP CONCAT('(^|[^[:alnum:]_''])', '-cat''s' , '([^[:alnum:]_'']|$)' )) res;
| res |
| --- |
| 1   |

Query #5

Select (' cat''s' 
    REGEXP CONCAT('(^|[^[:alnum:]_''])', ' cat''s' , '([^[:alnum:]_'']|$)' )) res;
| res |
| --- |
| 1   |

Query #6

Select ('cat'''
    REGEXP CONCAT('(^|[^[:alnum:]_''])', 'cat''' , '([^[:alnum:]_'']|$)' )) res;
| res |
| --- |
| 1   |

GMB
  • 216,147
  • 25
  • 84
  • 135