I've got MySQL V5.6.23 running on Amazon RDS. In it is an InnoDB table called product_details
that contains about 10 columns that are all indexed for exact matches (dates, numbers, text, etc.). I then have a single product_name
field that I've put a FULLTEXT index on. I also have lots of other fields we don't search on.
The table currently has 150M rows and we add about 3-5M every night, and also update another 10-20M every night. After running these inserts/updates at night, the FULLTEXT index seems to be dropped from memory (not sure that's exactly what's happening).
When I first run a query for say 'blue ford taurus', the query can take up to a few minutes. The second time I run it, it's a few seconds if not a few hundred milliseconds. If I run OPTIMIZE TABLE product_details;
after the processing of new data completes then almost every search I test is as fast as it can be. This takes hours to run OPTIMIZE TABLE
(as I think it's rewriting the entire table (and the indexes?)?!?!
I've thought about creating a "warming" script that will just hit the table with common queries from users, but I don't have a good mental model of what is going on, so I don't know what this will warm up. Searching for 'blue ford taurus' seems to speed up more than just that query, but I don't understand why.
QUESTIONS
How should one properly warm these indexes after loading new data every night? This table supports a web app with end users searching it every morning.
How do I know what memory requirements I need to hold my indexes?
Comments
- I'm planning to move this all to Elasticsearch (or similar), where I have a lot of experience doing search. I'm unfamiliar with MySQL as a FULLTEXT "search engine," but am stuck with it at the moment.
Common Query
SELECT * FROM product_details as pd
WHERE
MATCH (pd.product_name) AGAINST ('+ipod +nano' IN BOOLEAN MODE)
and pd.city_id IN (577,528,567,614,615,616,618)
ORDER BY(pd.timestamp) DESC
LIMIT 1000;
The Table
CREATE TABLE `product_details` (
`product_name` text NOT NULL,
`category_name` varchar(100) NOT NULL,
`product_description` text NOT NULL,
`price` int(11) NOT NULL,
`address` varchar(200) NOT NULL,
`zip_code` varchar(30) NOT NULL DEFAULT '',
`phone` bigint(10) DEFAULT NULL,
`email` varchar(50) NOT NULL,
`state` varchar(20) NOT NULL,
`city` varchar(30) NOT NULL,
`post_id` bigint(11) NOT NULL,
`post_date` date DEFAULT NULL,
`post_time` time NOT NULL,
`updated_date` varchar(10) NOT NULL,
`updated_time` time NOT NULL,
`status` tinyint(4) NOT NULL,
`timestamp` date NOT NULL,
`new_field` tinyint(4) DEFAULT NULL,
`multiple_items` tinyint(1) NOT NULL,
`city_id` int(4) NOT NULL,
`date_changed` date DEFAULT NULL,
`latlong` varchar(100) NOT NULL,
PRIMARY KEY (`post_id`),
KEY `city_id` (`city_id`),
KEY `post_date` (`post_date`),
KEY `price` (`price`),
KEY `category_name` (`category_name`),
KEY `state` (`state`),
KEY `multiple_items` (`multiple_items`),
KEY `new_field` (`new_field`),
KEY `phone` (`phone`),
KEY `timestamp` (`timestamp`),
KEY `date_changed` (`date_changed`),
FULLTEXT KEY `product_name` (`product_name`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
The table status data above is actually for my dev table that only has 18M rows in it. When I load all production data, it will have ~8x the amount of data meaning the data_length
will be ~70GB and the index_length
will be ~32GB.