How to properly warm a MySQL FULLTEXT index?

Question

I've got MySQL V5.6.23 running on Amazon RDS. In it is an InnoDB table called product_details that contains about 10 columns that are all indexed for exact matches (dates, numbers, text, etc.). I then have a single product_name field that I've put a FULLTEXT index on. I also have lots of other fields we don't search on.

The table currently has 150M rows and we add about 3-5M every night, and also update another 10-20M every night. After running these inserts/updates at night, the FULLTEXT index seems to be dropped from memory (not sure that's exactly what's happening).

When I first run a query for say 'blue ford taurus', the query can take up to a few minutes. The second time I run it, it's a few seconds if not a few hundred milliseconds. If I run OPTIMIZE TABLE product_details; after the processing of new data completes then almost every search I test is as fast as it can be. This takes hours to run OPTIMIZE TABLE (as I think it's rewriting the entire table (and the indexes?)?!?!

I've thought about creating a "warming" script that will just hit the table with common queries from users, but I don't have a good mental model of what is going on, so I don't know what this will warm up. Searching for 'blue ford taurus' seems to speed up more than just that query, but I don't understand why.

QUESTIONS

How should one properly warm these indexes after loading new data every night? This table supports a web app with end users searching it every morning.
How do I know what memory requirements I need to hold my indexes?

Comments

I'm planning to move this all to Elasticsearch (or similar), where I have a lot of experience doing search. I'm unfamiliar with MySQL as a FULLTEXT "search engine," but am stuck with it at the moment.

Common Query

SELECT * FROM product_details as pd 
WHERE
    MATCH (pd.product_name) AGAINST ('+ipod +nano' IN BOOLEAN MODE)
    and pd.city_id IN (577,528,567,614,615,616,618) 
ORDER BY(pd.timestamp) DESC
LIMIT 1000;

The Table

CREATE TABLE `product_details` (
  `product_name` text NOT NULL,
  `category_name` varchar(100) NOT NULL,
  `product_description` text NOT NULL,
  `price` int(11) NOT NULL,
  `address` varchar(200) NOT NULL,
  `zip_code` varchar(30) NOT NULL DEFAULT '',
  `phone` bigint(10) DEFAULT NULL,
  `email` varchar(50) NOT NULL,
  `state` varchar(20) NOT NULL,
  `city` varchar(30) NOT NULL,
  `post_id` bigint(11) NOT NULL,
  `post_date` date DEFAULT NULL,
  `post_time` time NOT NULL,
  `updated_date` varchar(10) NOT NULL,
  `updated_time` time NOT NULL,
  `status` tinyint(4) NOT NULL,
  `timestamp` date NOT NULL,
  `new_field` tinyint(4) DEFAULT NULL,
  `multiple_items` tinyint(1) NOT NULL,
  `city_id` int(4) NOT NULL,
  `date_changed` date DEFAULT NULL,
  `latlong` varchar(100) NOT NULL,
  PRIMARY KEY (`post_id`),
  KEY `city_id` (`city_id`),
  KEY `post_date` (`post_date`),
  KEY `price` (`price`),
  KEY `category_name` (`category_name`),
  KEY `state` (`state`),
  KEY `multiple_items` (`multiple_items`),
  KEY `new_field` (`new_field`),
  KEY `phone` (`phone`),
  KEY `timestamp` (`timestamp`),
  KEY `date_changed` (`date_changed`),
  FULLTEXT KEY `product_name` (`product_name`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

Table Status

The table status data above is actually for my dev table that only has 18M rows in it. When I load all production data, it will have ~8x the amount of data meaning the data_length will be ~70GB and the index_length will be ~32GB.

score 2 · Accepted Answer · answered Mar 12 '16 at 01:54

Optimize (or not). Yes OPTIMIZE TABLE copies over the table and rebuilds all the indexes, so it takes a long time. Do not run OPTIMIZE; it almost never helps. (Or do you see a significant change?)

Tuning. How much RAM do you have? How big are the indexes? SHOW TABLE STATUS.

innodb_buffer_pool_size should be about 70% of available RAM.

Shrinking the schema would help a little:

It is rarely good to split a DATE and TIME into two fields
Why have both city and city_id in this table. Perhaps you should normalize city and state and maybe zip_code into another table (one, not two other tables).
ids should be sized appropriately -- city_id could be SMALLINT UNSIGNED (2 bytes: 0..65535) instead of INT SIGNED (4 bytes).
Normalize category_name and any other repetitious columns?
updated_date is a VARCHAR??

The steps in your query

Locate all the ids for products with both ipad and nano. Let's say there were 5555 such rows.
Go all 5555 rows, collect the desired info, which is all columns because of *. It sounds like the table is a lot larger than RAM, so this means something like 5555 disk reads -- likely to be the slowest part.
Filter out undesired rows based on city_id. Let's say we are down to 3210 rows.
Write all columns of all 3210 rows to a tmp table. Since there is a TEXT column, it will be a MyISAM table, not a faster MEMORY table.
Sort on timestamp
Deliver the first 1000.

As I hope you can see, bulky rows mean bulky stuff in the tmp table. Reduce the * and/or shrink the columns.

Here's a trick to cut back on the tmp table size (steps 4,5,6):

SELECT  ...
    FROM  product_details as pd
    JOIN  
      ( SELECT  post_id
            FROM  product_details
            WHERE  MATCH (product_name) AGAINST ('+ipod +nano' IN BOOLEAN MODE)
              and  city_id IN (577,528,567,614,615,616,618)
            ORDER BY timestamp DESC
            LIMIT  1000
      ) x USING (post_id)
    ORDER BY  pd.timestamp;

However, the tmp table is not the worst part, and this requires a second sort. So, you could try this, but don't hold your breath.

Please note that when you run a test that is likely to be I/O-bound, run it twice. The second run will be a fairer comparison because it, presumably, will have no I/O.

Another layer should be faster:

SELECT  pd...
    FROM  
      ( SELECT  post_id
            FROM  product_details
            WHERE  MATCH (product_name) AGAINST ('+ipod +nano' IN BOOLEAN MODE) 
      ) AS a
    JOIN  product_details AS b ON b.post_id = a.post_id
    WHERE  b.city_id IN (577,528,567,614,615,616,618)
    ORDER BY  b.timestamp DESC
    LIMIT  1000 ) x
    JOIN  product_details as pd ON pd.post_id = b.post_id
    ORDER BY  pd.timestamp;

INDEX(post_id, city_id, timestamp) -- also required for this formulation

The hope with this formulation is

The secondary filtering (city_id) is done on a much smaller BTree (that index), hence more likely to live in RAM, thereby avoiding I/O.
Only 1000 probes are needed into the big table. This should be a big win.

The steps:

Get 5555 ids from the FULLTEXT index.
Filter down to 3210 ids using what will hopefully be an in-memory operation.
Sort 3210 'narrow' rows (just 3 columns, not all). This time it can be a MEMORY tmp table.
JOIN back to the original table only 1000 times. (The big win.) (I could be wrong here; it could be 3210, but that is still better than 5555.)
Deliver the results.

I added `table status` data to my question. How does this affect my RAM requirements for speedy queries? Do you have thoughts on warming the `FREETEXT` index at night before users are in the system? A lot of the scruff in this system is legacy, and I'm trying to get a broken site working again for a few months while we rebuild the entire system. Great suggestions. Thanks!! — T. Brian Jones, Mar 12 '16 at 02:37
If you don't restart the system and don't do big queries overnight, then you should not need to "prime the cache". Yes, it might help to run some representative queries. But that would help only queries like that. My suggestions attempt to help all queries. — Rick James, Mar 12 '16 at 04:00
If you have 16GB of RAM, you probably have a lot of I/O going on. (And your comments imply that you have a small RAM.) If you have over 100GB of RAM, then I am puzzled at the slowdown. Do you have other huge tables that are 'busy'? — Rick James, Mar 12 '16 at 04:07
I currently have 16GB of memory. What specifically do I need to have enough memory for, for queries to be fast? If table_status index_length is 32GB, do I need 32GB of memory (probably more like 40GB)? What loads the indexes into memory? There are steps that are missing that I'm trying to wrap my head around. If I restart the server, memory is cleared and queries are slow. What puts these indexes back into memory? Can I force the system to load them all? — T. Brian Jones, Mar 15 '16 at 01:34
Did you try the reformulated queries? As for RAM -- it depends on how many _different_ terms you look up, and how much other data, etc, you need to access. The 5555 hits to the index is probably less I/O than the 3210 hits to the data. You would be safer with enough RAM for Data_length + Index_length. — Rick James, Mar 15 '16 at 03:37

How to properly warm a MySQL FULLTEXT index?

1 Answers1