MySQL Fulltext in Binary Mode doesn't match phrases

Question

Implementing a simple fulltext search I encountered a problem with the combination of boolean mode and phrases. Also worth noting is that the column has a binary collation (utf8_bin) whilst the table does not have this.

Given the following setup:

CREATE TABLE `test` (
  `test_id` int(11) NOT NULL AUTO_INCREMENT,
  `text_bin` text CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
  PRIMARY KEY (`test_id`),
  FULLTEXT KEY `text_bin` (`text_bin`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

INSERT INTO `test` (`test_id`, `text_bin`) VALUES
(1, 'Lorem Ipsum Dolor Sit Amet.'),
(2, 'Consectetuer Adipiscing Elit.'),
(3, 'Amet Sit Dolor Ipsum Lorem.')
;

Then running this query:

SELECT t.test_id, t.text_bin,
  MATCH(t.text_bin) AGAINST ('Lorem Ipsum' IN BOOLEAN MODE) as m_Words,
  MATCH(t.text_bin) AGAINST ('"Lorem Ipsum"' IN BOOLEAN MODE) as m_Phrase,
  MATCH(t.text_bin) AGAINST ('Lorem' IN BOOLEAN MODE) as m_Lorem,
  MATCH(t.text_bin) AGAINST ('Ipsum' IN BOOLEAN MODE) as m_Ipsum
FROM test t
;

This yields the following results:

test_id	text_bin	m_Words	m_Lorem	m_Ipsum
1	Lorem Ipsum Dolor Sit Amet.	0.0620	0.0310	0.0310
2	Consectetuer Adipiscing Elit.	0	0	0
3	Amet Sit Dolor Ipsum Lorem.	0.0620	0.0310	0.0310

(Note: I shortened the numbers to 4 decimal places for better readability.)

For the column m_Phrase I would expect a value greater then 0 on the first row. Is this a bug or can someone explain why the result is 0?

DB Fiddle: https://www.db-fiddle.com/f/8qxR3SiPVtESU3saebhgBG/0

Yeah, I would definitely expect >0 value there. In fact, that should be the highest value as the phrase is found in 1 out of 3 row, not in 2 out of 3. — Shadow, Nov 14 '22 at 14:48

dangarfield · Answer 1 · 2022-11-17T15:27:19.883

My gut instinct is that this is to do with space delimiter. The same happens with utf8_unicode_ci

Eg in this fiddle different collations give different results with both case sensitivity and matching of spaces as part of literals - https://www.db-fiddle.com/f/pi78uuA1RCFeToaRH9skcK/2

Update - Fiddle above updated with special chars (.#*), which match phrases as expected.

I don't know the solution to your problem other than use a different collation - just use utf8_general_ci

It would be interesting to know your use case or even the performance considerations that you face that (assumably) prompt you to use utf8_bin.

Update 2:

This looks like a reported bug. Although, from 2018... One fix appears to be to downgrade to version 5.7.19.

It doesn't appear as though there is a working latest version approach for both a FULLTEXT based query with literal phrases and case sensitivity. Very very strange. The simplest (not perfect) approach would be the prepend each word with a plus, +Lorem +Ipsum, but would obviously lead to some false positives. (See above fiddle with m_all and m_AllLower fields.

Thanks for your reply. I'm using `utf8_bin` so that the search is **case sensitive**. — jor, Nov 17 '22 at 14:24

score 0 · Answer 2 · answered Nov 18 '22 at 23:48

0

Still failing in 8.0.31.

This might be a workaround:

WHERE MATCH(t.text_bin) AGAINST ('+Lorem +Ipsum' IN BOOLEAN MODE)
  AND t.text_bin LIKE '%Lorem Ipsum%'

That is, avoid the phrase test in FT and use LIKE as a secondary filter to check for the phrase.

Alas, it won't let you display the relevance.

answered Nov 18 '22 at 23:48

Rick James

135,179
13
127
222

1

Yes, falling back to a `LIKE` condition is a workaround. For more complex phrases (i.e. wildcard) you can even use `REGEXP`. Only downside is missing relevance score and bad performance on a large text corpus. – jor Nov 23 '22 at 07:18

MySQL Fulltext in Binary Mode doesn't match phrases

2 Answers2