5

So essentially I have two tables, containing URLS and TAGS, with a has-and-belongs-to-many relationship between the two via a joins tables TAGS_URLS.

A simple query to find URL's by tags would be:

SELECT urls.id FROM urls 
  INNER JOIN tags_urls ON urls.id=tags_urls.url_id
  INNER JOIN tags ON tags_urls.tag_id=tags.id 
WHERE tags.tag IN ("sample","tag","list");

However, I'm trying to recover an intersection of all URL's that contain all of a set of tags. I.e., only URL's that contain the tag "sample" AND "tag" AND "list".

I have a working query, but I cannot get the query to execute in less than 30 seconds.

SELECT a.id
  FROM
    (SELECT DISTINCT urls.id FROM urls
      INNER JOIN tags_urls ON tags_urls.url_id=urls.id INNER JOIN tags ON tags.id=tags_urls.tag_id
      WHERE tags.tag = 'sample') a
  JOIN
     (SELECT DISTINCT urls.id FROM urls
      INNER JOIN tags_urls ON tags_urls.url_id=urls.id INNER JOIN tags ON tags.id=tags_urls.tag_id
      WHERE tags.tag = 'list') b
  ON a.id = b.id;

The result set is correct, but the performance is horrific.

I do also currently have the data duplicated in a Redis database as a list of URL id's stored in tag sets so I can do something like this and get a result set VERY quickly.

SINTER "tag-sample" "tag-list"

Would it be possible, with reasonable effort, to bring the MySQL performance for this task up to the Redis levels with SINTER?

Steve C.
  • 73
  • 6
  • Have you already checked your indexes? Have you done an EXPLAIN on the SELECT statement to see if you are using the indexes you think you are? – Brad Apr 15 '11 at 22:07
  • Everything is indexed, and EXPLAIN shows it using the proper indexes for the PRIMARY keys (id), both fields in the tags_urls join table are indexed ID's and the tag names are indexed as well. Another note is - if I execute the two statements to pull the list of ID's by tag, they take less than 5ms each, individually. – Steve C. Apr 15 '11 at 22:54
  • How large is the data? I believe what the RDBMS is doing under the hood is creating two temp tables (one each for the subselects) and then doing the join between those two temp tables. Could you post the output of the EXPLAIN? If it is copying to a temp table, removing the need for the temp table should give you much better performance. – Brad Apr 15 '11 at 22:59
  • Yes it is definitely using temporary tables. There are approx 600k Url's and 40k Tags and 5.2million rows in the tags_urls table. – Steve C. Apr 15 '11 at 23:02
  • LOL, okay. Then take the 'I think' part out of my answer below... That is def your problem then. – Brad Apr 15 '11 at 23:10

2 Answers2

1

I am not 100% sure, but I think the underlying engine is creating a temp table for each of your subselects. Depending on the size of your data, this can be quite costly. If they are big (and they are in your case) temp tables have to write their contents out to disk because they are too big to hold in memory at once. So basically your query is copying huge amounts of data as it tries to build out two temporary tables that match the selection criteria for your two subselects. Once this is done, it finally executes the outer select and this most likely rather fast.

I would try to factor the subselects out for inner joins. I think the following will give you what you are looking for:

select urls.id from urls
inner join tags_urls tu1 on tu1.url_id = urls.id
inner join tags t1 on tu1.tag_id = t1.id and t1.tag = 'sample'
inner join tag_urls tu2 on tu2.url_id = urls.id
inner join tags t2 on t2.id = tu2.tag_id and t2.tag = 'list'

You would continue to add pairs of inner joins to tag_urls and tags for each 'tag' you wanted to intersect with. Again, run this through explain and make sure everything has the right index.

DBMS's can do pretty well with a several inner joins but as you increase the number of intersections, your performance will decrease.

Brad
  • 5,428
  • 1
  • 33
  • 56
  • A definite improvement, down to 6.02sec from 30. 6 seconds is still higher than I'm comfortable with for a web-app. Would re-arranging the data help any? I can't imagine a 5.2m row joins table helping performance all that much. – Steve C. Apr 15 '11 at 23:10
  • Paste in the create table statements for the three tables. Make sure they include the indexes. – Brad Apr 15 '11 at 23:13
  • I think I may split out the query into two parts, fetching the ID's of the tags first and using that in the query gets it down to 0.15sec. – Steve C. Apr 15 '11 at 23:14
  • Beyond the scope of this question, you could try to pre-compute the intersections ahead of time. This way your live web traffic reads from a cache while the slower queries run in the background every so often to update the cache. You should be able to squeeze some more perf out of this though before you need that. – Brad Apr 15 '11 at 23:15
  • @Steve, that is a typical thing to do. It does sound a little bit though like you may be missing an index across the joins. MySQL will not combine indexes. So your index needs to be on the PK/FK + the contents of the tag (e.g., one index on the tags table needs to be PK + tag, in that order). – Brad Apr 15 '11 at 23:19
  • You should expect subsecond queries with table sizes such as yours, given the right indexes. – dkretz Apr 16 '11 at 01:04
0

You might try replacing the sql subqueries in the 2nd statmenet with joins. Robert Vieira claims in his Sql Server books that sometimes joins are faster and sometimes subqueries are faster. Hard to believe that the same would not be true for MySql as well. Also, if there ia quite a bit of other data in the table, other than 'list' or 'sample', then you might want to insert this data into a temp table and run your queries from that table instead. This is especially true if you are going to be running multiple queries on that data.

Bob Bryan
  • 3,687
  • 1
  • 32
  • 45