0

I have a largish table (2 million rows), with a column containing text identifiers (these are latin names of species, Homo_sapiens, Tyranosaurus_rex, etc.)

I have another table containing latin names and "common" names of species, and I can query this to get a small selection (~140 names) of latin names, some of which map into the first table. I want to get the rows in the first table whose names map exactly to this small selection. The query I'm using to get the small selection (only 140 rows) runs fast, since the common name has a mySQL 'FULLTEXT" index

select distinct latin_name from common_names_table 
  where match(common_name) against('+*mo*' in boolean mode)

But then if I try to use the SQL IN operator to match these into the large, 2 million row table, it takes many minutes,

select latin_name,popularity from big_table 
 where latin_name in (
  select distinct latin_name from common_names_table 
    where match(common_name) against('+*mo*' in boolean mode)
  )
 ORDER BY popularity DESC LIMIT 50;

That is true even though I have set both a full text and a normal index on the latin_name column.

CREATE FULLTEXT INDEX name_fulltext_index ON big_table (latin_name);
CREATE INDEX name_index          ON big_table (latin_name);

How can I speed this up? Is there a problem using the IN operator with indexed text fields? If so, is there some special sort of "exact match" index I can use for text fields? The latin_name fields are both of type "VARCHAR" and of max length 190 in the small table and 200 in the large, if that makes any difference.

Thanks for any help


As requested - here are the table definitions:

CREATE TABLE `big_table` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `parent` int(11) NOT NULL,
  `latin_name` varchar(200) DEFAULT NULL,
  `popularity` double DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `pop_index` (`popularity`),
  KEY `name_index` (`latin_name`),
  FULLTEXT KEY `name_fulltext_index` (`latin_name`)
) ENGINE=InnoDB AUTO_INCREMENT=1781766 DEFAULT CHARSET=utf8;

CREATE TABLE `common_name_table` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `latin_name` varchar(190) CHARACTER SET utf8mb4 NOT NULL DEFAULT '',
  `common_name` varchar(190) CHARACTER SET utf8mb4 NOT NULL,
  PRIMARY KEY (`id`),
  KEY `name_index` (`latin_name`),
  FULLTEXT KEY `common_name_index` (`common_name`)
) ENGINE=InnoDB AUTO_INCREMENT=2024 DEFAULT CHARSET=utf8;
user2667066
  • 1,867
  • 2
  • 19
  • 30

3 Answers3

1

You can try a join instead of a 'IN':

select
b.latin_name,
b.popularity
from
(
    select distinct latin_name from common_names_table 
    where match(common_name) against('+*mo*' in boolean mode)
) a
left join big_table as b on (a.latin_name=b.latin_name)
where b.latin_name IS NOT NULL
ORDER BY b.popularity DESC LIMIT 50;

Left join (where right side not null) will probably be faster than a inner join

verhie
  • 1,298
  • 1
  • 7
  • 7
  • Thanks. Oddly, that seems also to take many, many minutes to run, as does an inner join. But if I reverse the join, and do `big_table left join (select ...)` it is super fast, as long as I don't specify `a.latin_name is not null`. I can't work out why simply swapping the join would have such a huge effect – user2667066 Oct 18 '16 at 09:27
  • Hadn't seen the table definitions. I think the utf8mb4 in the second table is causing a charset issue. Maybe you can convert it explicitly in the subquery it will speed it up. On the other hand, if it works the other way round, problem solved :). In the first query the big table gets converted. If you swap tables as you have done, it will convert the small table, and that's faster – verhie Oct 18 '16 at 10:44
1

LEFT is not necessary:

select  b.latin_name, b.popularity
    from  
    (
        SELECT  distinct latin_name
            from  common_names_table
            where  match(common_name) against('+*mo*' in boolean mode) 
    ) cn
    join  big_table as b  ON (cn.latin_name = b.latin_name)
    ORDER BY  b.popularity DESC
    LIMIT  50;

To get a feel for why it is slow, perform

        SELECT  COUNT(distinct latin_name)
            from  common_names_table
            where  match(common_name) against('+*mo*' in boolean mode);

That many rows need to be found in big_table before the sort and limit.

Rick James
  • 135,179
  • 13
  • 127
  • 222
0

AHA - thanks to @krishKM asking for the definitions, I've found the problem. The character set encoding of the two columns that I am trying to match is different: one is the default UTF8 in mySQL, the other is the 'proper' 4 byte utf8mb4 encoding.

If I set latin_name to the same character encoding in both tables, the query takes ~20 milliseconds instead of 5 minutes.

user2667066
  • 1,867
  • 2
  • 19
  • 30