2

Implementing a simple fulltext search I encountered a problem with the combination of boolean mode and phrases. Also worth noting is that the column has a binary collation (utf8_bin) whilst the table does not have this.

Given the following setup:

CREATE TABLE `test` (
  `test_id` int(11) NOT NULL AUTO_INCREMENT,
  `text_bin` text CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
  PRIMARY KEY (`test_id`),
  FULLTEXT KEY `text_bin` (`text_bin`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

INSERT INTO `test` (`test_id`, `text_bin`) VALUES
(1, 'Lorem Ipsum Dolor Sit Amet.'),
(2, 'Consectetuer Adipiscing Elit.'),
(3, 'Amet Sit Dolor Ipsum Lorem.')
;

Then running this query:

SELECT t.test_id, t.text_bin,
  MATCH(t.text_bin) AGAINST ('Lorem Ipsum' IN BOOLEAN MODE) as m_Words,
  MATCH(t.text_bin) AGAINST ('"Lorem Ipsum"' IN BOOLEAN MODE) as m_Phrase,
  MATCH(t.text_bin) AGAINST ('Lorem' IN BOOLEAN MODE) as m_Lorem,
  MATCH(t.text_bin) AGAINST ('Ipsum' IN BOOLEAN MODE) as m_Ipsum
FROM test t
;

This yields the following results:

test_id text_bin m_Words m_Phrase m_Lorem m_Ipsum
1 Lorem Ipsum Dolor Sit Amet. 0.0620 0 0.0310 0.0310
2 Consectetuer Adipiscing Elit. 0 0 0 0
3 Amet Sit Dolor Ipsum Lorem. 0.0620 0 0.0310 0.0310

(Note: I shortened the numbers to 4 decimal places for better readability.)

For the column m_Phrase I would expect a value greater then 0 on the first row. Is this a bug or can someone explain why the result is 0?

DB Fiddle: https://www.db-fiddle.com/f/8qxR3SiPVtESU3saebhgBG/0

Shadow
  • 33,525
  • 10
  • 51
  • 64
jor
  • 2,058
  • 2
  • 26
  • 46
  • Yeah, I would definitely expect >0 value there. In fact, that should be the highest value as the phrase is found in 1 out of 3 row, not in 2 out of 3. – Shadow Nov 14 '22 at 14:48

2 Answers2

0

My gut instinct is that this is to do with space delimiter. The same happens with utf8_unicode_ci

Eg in this fiddle different collations give different results with both case sensitivity and matching of spaces as part of literals - https://www.db-fiddle.com/f/pi78uuA1RCFeToaRH9skcK/2

Update - Fiddle above updated with special chars (.#*), which match phrases as expected.

I don't know the solution to your problem other than use a different collation - just use utf8_general_ci

It would be interesting to know your use case or even the performance considerations that you face that (assumably) prompt you to use utf8_bin.


Update 2:

This looks like a reported bug. Although, from 2018... One fix appears to be to downgrade to version 5.7.19.

It doesn't appear as though there is a working latest version approach for both a FULLTEXT based query with literal phrases and case sensitivity. Very very strange. The simplest (not perfect) approach would be the prepend each word with a plus, +Lorem +Ipsum, but would obviously lead to some false positives. (See above fiddle with m_all and m_AllLower fields.

dangarfield
  • 2,210
  • 1
  • 13
  • 18
0

Still failing in 8.0.31.

This might be a workaround:

WHERE MATCH(t.text_bin) AGAINST ('+Lorem +Ipsum' IN BOOLEAN MODE)
  AND t.text_bin LIKE '%Lorem Ipsum%'

That is, avoid the phrase test in FT and use LIKE as a secondary filter to check for the phrase.

Alas, it won't let you display the relevance.

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • 1
    Yes, falling back to a `LIKE` condition is a workaround. For more complex phrases (i.e. wildcard) you can even use `REGEXP`. Only downside is missing relevance score and bad performance on a large text corpus. – jor Nov 23 '22 at 07:18