0

I have 3 tables in which I'm trying to preform joins on, and inserting the resulting data into another table. The query is taking anywhere between 15-30 mins depending on the dataset. The tables I'm selecting from and joining on are at least 25k records each but will quickly grow to be 500k+.

I tried adding indexes on the fields but still isn't helping that much. Are there any other things I can try or are joins on this scale just going to take this long?

Here is the query I'm trying to perform:

INSERT INTO audience.topitem
(runs_id, total_training_count, item, standard_index_value, significance, seed_count, nonseed_count, prod, model_type, level_1, level_2, level_3, level_4, level_5)
SELECT 5, seed_count + nonseed_count AS total_training_count,
ii.item, standard_index_value, NULL, seed_count, nonseed_count,
standard_index_value * seed_count AS prod, 'site', topic_L1, topic_L2, topic_L3, topic_L4, topic_L5
FROM audience.item_indexes ii
LEFT JOIN audience.usercounts uc ON ii.item = uc.item AND ii.runs_id = uc.runs_id 
LEFT JOIN categorization.categorization at on ii.item = at.url
WHERE ii.runs_id = 5

Table: audience.item_indexes

CREATE TABLE `item_indexes` (
`item` varchar(1024) DEFAULT NULL,
`standard_index_value` float DEFAULT NULL,
`runs_id` int(11) DEFAULT NULL,
`model_type` enum('site','term','combo') DEFAULT NULL,
KEY `item_idx` (`item`(333))
) ENGINE=MyISAM DEFAULT CHARSET=utf8; 

Table: audience.usercounts

CREATE TABLE `usercounts` (
`item` varchar(1024) DEFAULT NULL,
`seed_count` int(11) DEFAULT NULL,
`nonseed_count` int(11) DEFAULT NULL,
`significance` float(19,6) DEFAULT NULL,
`runs_id` int(11) DEFAULT NULL,
`model_type` enum('site','term','combo') DEFAULT NULL,
KEY `item_idx` (`item`(333))
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

Table: audience.topitem

CREATE TABLE `topitem` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`total_training_count` int(11) DEFAULT NULL,
`item` varchar(1024) DEFAULT NULL,
`standard_index_value` float(19,6) DEFAULT NULL,
`significance` float(19,6) DEFAULT NULL,
`seed_count` int(11) DEFAULT NULL,
`nonseed_count` int(11) DEFAULT NULL,
`prod` float(19,6) DEFAULT NULL,
`cat_type` varchar(32) DEFAULT NULL,
`cat_level` int(11) DEFAULT NULL,
`conf` decimal(19,9) DEFAULT NULL,
`level_1` varchar(64) DEFAULT NULL,
`level_2` varchar(64) DEFAULT NULL,
`level_3` varchar(64) DEFAULT NULL,
`level_4` varchar(64) DEFAULT NULL,
`level_5` varchar(64) DEFAULT NULL,
`runs_id` int(11) DEFAULT NULL,
`model_type` enum('site','term','combo') DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=825 DEFAULT CHARSET=utf8;

Table: categorization.categorization

CREATE TABLE `AT_categorization` (
`url` varchar(760) NOT NULL ,
`language` varchar(10) DEFAULT NULL,
`category` text,
`entity` text,
`source` varchar(255) DEFAULT NULL,
`topic_L1` varchar(45) NOT NULL DEFAULT '',
`topic_L2` varchar(45) NOT NULL DEFAULT '',
`topic_L3` varchar(45) NOT NULL DEFAULT '',
`topic_L4` varchar(45) NOT NULL DEFAULT '',
`topic_L5` varchar(45) NOT NULL DEFAULT '',
`last_refreshed` datetime DEFAULT NULL,
PRIMARY KEY (`url`,`topic_L1`,`topic_L2`,`topic_L3`,`topic_L4`,`topic_L5`),
UNIQUE KEY `inx_url` (`url`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
Dan Ramos
  • 1,092
  • 2
  • 19
  • 35
  • The indexes and data types are really bad. You should use shorter keys and use more concise data types. Also, you should `EXPLAIN` the `SELECT` part of the query. Even with 500k+ rows in each table, this query should not take more than a minute. – G-Nugget Apr 25 '13 at 04:24
  • Could you be a little more specific? What do you exactly mean by shortening the keys? Changing varchar(1024) to varchar(512) or something along those lines? I'm also not sure how I can make the datatypes more concise. Specifically speaking, the columns I'm doing joins on are all varchars because they can contain any kind of character/number/special character. I'm assuming a varchar would be the best option in this case? – Dan Ramos Apr 25 '13 at 04:34
  • Rule 1 of database theory: don't join on `VARCHAR`s. You should restructure the tables so the rows can be referenced by a unique `INT` id. This will reduced the length of the indexes used in the `JOIN`s from 333 bytes each to 4 bytes each, significantly reduce the size of the rows due to storing less redundant data, and significantly improve performance. I'd write an answer, but I don't have time for that now. – G-Nugget Apr 25 '13 at 04:41

1 Answers1

1

If you add the following indexes, your query will run faster:

CREATE INDEX runs_idx ON audience.item_indexes (runs_id);

ALTER TABLE audience.usercounts
DROP INDEX item_idx,
ADD INDEX item_idx (runs_id, item(333));

Also, item_indexes is utf8, but AT_categorization is latin1, which keeps any indexes from being used. To address this issue, change AT_categorization to utf8:

ALTER TABLE AT_categorization CHARSET=utf8;

Lastly, for the AT_categorization table, the two indexes

PRIMARY KEY (`url`,`topic_L1`,`topic_L2`,`topic_L3`,`topic_L4`,`topic_L5`),
UNIQUE KEY `inx_url` (`url`)

are redundant. So you could DROP these, and simply have the url field be the primary key:

ALTER TABLE AT_categorization 
DROP PRIMARY KEY,
DROP KEY `inx_url`,
ADD PRIMARY KEY (url);
Ross Smith II
  • 11,799
  • 1
  • 38
  • 43
  • I made the changes and now im trying to test it out step by step. But it doesn't appear to be using the indexes: `explain select ii.item FROM item_indexes ii force index (item) left join usercounts uc ON ii.item = uc.item` This gives me null for possible_keys – Dan Ramos Apr 25 '13 at 05:54
  • That's a different query. If you need *that* query to work, then don't drop the original `item_idx`. – Ross Smith II Apr 25 '13 at 06:05