MySQL complex subquery formulation

Question

I have two tables - books and images. books has columns like id, name, releasedate, purchasecount. images has bookid (which is same as the id in books, basically one book can have multiple images. Although I haven't set any foreign key constraint), bucketid, poster (each record points to an image file in a certain bucket, for a certain bookid).

Table schema:

poster is unique in images, hence it is a primary key.
Covering index on books: (name, id, releasedate)
Covering index on images: (bookid,poster,bucketid)

My query is, given a name, find the top ten books (sorted by number of purchasecount) from the books table whose name matches that name, and for that book, return any (preferably the first) record (bucketid and poster) from the images table.

Obviously this can be solved by two queries by running the first, and using its results to query the images table, but that will be slow, so I want to use 'join' and subquery to do it in one go. However, what I am trying is not giving me correct results:

select books.id,books.name,year(releasedate),purchasecount,bucketid,poster from books 
inner join (select bucketid,bookid, poster from images) t on 
t.bookid  = books.id  where name like "%foo%" order by purchasecount desc limit 2;

Can anybody suggest an optimal query to fetch the result set as desired here (including any suggestion to change the table schema to improve search time) ?

Updated fiddle: http://sqlfiddle.com/#!9/17c5a8/1.

The example query should return two results - fooe and fool, and one (any of the multiple posters corresponding to each book) poster for each result. However I am not getting correct results. Expected:

fooe - 1973 - 459 - 11 - swt (or fooe - 1973 - 459 - 11 - pqr)

fool - 1963 - 456 - 12 - xxx (or fool - 1963 - 456 - 111 - qwe)

Image file name. All images lie in the same bucket hence names have to be unique. — SexyBeast, Jun 08 '16 at 22:56
Even with the best will in the world, a filename can change. This is not a good PRIMARY KEY — Strawberry, Jun 08 '16 at 22:57
The filenames are generated by a 3rd party API and are guaranteed to be unique. Plus before inserting into the bucket, additional check is done to see if such a file exists by that name, in which case generate another name (and go on) until a unique one is generated, and then dumped into the bucket. We may need to query by the image name as well (like how many "likes" this image has got), hence primary key, instead of making another unique id for it, since it is anyhow guaranteed to be unique. — SexyBeast, Jun 08 '16 at 23:00
We love questions with schemas in text and sample data. A sqlfiddle is wonderful. Otherwise we often flee. You are asking someone to pound that out when you already have it. — Drew, Jun 08 '16 at 23:10
Your expected results don't match the data in the fiddle as described. Also how many books do you want returned. The question is vague. Also, you can say equals the name but then in your query say like %God% ... two totally different things. What I have found is that queries are not complex. What is is writing a good question. — Drew, Jun 09 '16 at 02:21
Can you tell me where it is vague? I have clearly mentioned `from the books table whose name matches that name` in the question. And I edited my question to provide the expected data, and also edited the fiddle. Let me put it forward in another way. First target, give the top ten books that match the query based on `purchasecount`. For each book, provide any image (`bucketid` and `poster`) from the multiple ones possible. — SexyBeast, Jun 09 '16 at 07:20
I can re-read it as you modified the sqlfiddle in the last 4 to 5 hours. I can be reached in [Campaigns](http://chat.stackoverflow.com/rooms/95290) to discuss further. Just ping me there with an @drew msg so my keyboard beeps at me. Also, learn to @ people in messages so they know to come back to talk to you under posts. — Drew, Jun 09 '16 at 11:48

score 1 · Accepted Answer · answered Jun 09 '16 at 14:08

I agree with Strawberry about the schema. We can discuss ideas for better performance and all that. But here is my take on how to solve this after a few chats and changes to the question.

Note below the data changes to deal with various boundary conditions which include books with no images in that table, and tie-breaks. Tie-breaks meaning using the max(upvotes). The OP changed the question a few times and added a new column in the images table.

Modified quetion became return 1 row make per book. Scratch that, always 1 row per book even if there are no images. The image info to return would be the one with max upvotes.

Books table

create table books 
(   id int primary key, 
    name varchar(1000), 
    releasedate date, 
    purchasecount int
) ENGINE=InnoDB;

insert into books values(1,"fool","1963-12-18",456);
insert into books values(2,"foo","1933-12-18",11);
insert into books values(3,"fooherty","1943-12-18",77);
insert into books values(4,"eoo","1953-12-18",678);
insert into books values(5,"fooe","1973-12-18",459);
insert into books values(6,"qoo","1983-12-18",500);

Data Changes from original question.

Mainly the new upvotes column.

The below includes a tie-break row added.

create table images 
(   bookid int, 
    poster varchar(150) primary key, 
    bucketid int, 
    upvotes int -- a new column introduced by OP
) ENGINE=InnoDB;

insert into images values (1,"xxx",12,27);
insert into images values (5,"pqr",11,0);
insert into images values (5,"swt",11,100);
insert into images values (2,"yyy",77,65);
insert into images values (1,"qwe",111,69);
insert into images values (1,"blah_blah_tie_break",111,69);
insert into images values (3,"qwqqe",14,81);
insert into images values (1,"qqawe",8,45);
insert into images values (2,"z",81,79);

Visualization of a Derived Table

This is just to assist in visualizing an inner piece of the final query. It demonstrates the gotcha for tie-break situations, thus the rownum variable. That variable is reset to 1 each time the bookid changes otherwise it increments. In the end (our final query) we only want rownum=1 rows so that max 1 row is returned per book (if any).

Final Query

select b.id,b.purchasecount,xDerivedImages2.poster,xDerivedImages2.bucketid
from books b
left join
(   select i.bookid,i.poster,i.bucketid,i.upvotes,
    @rn := if(@lastbookid = i.bookid, @rn + 1, 1) as rownum,
    @lastbookid := i.bookid as dummy
    from 
    (   select bookid,max(upvotes) as maxup
        from images
        group by bookid
    ) xDerivedImages
    join images i
    on i.bookid=xDerivedImages.bookid and i.upvotes=xDerivedImages.maxup
    cross join (select @rn:=0,@lastbookid:=-1) params
    order by i.bookid
) xDerivedImages2
on xDerivedImages2.bookid=b.id and xDerivedImages2.rownum=1
order by b.purchasecount desc
limit 10

Results

+----+---------------+---------------------+----------+
| id | purchasecount | poster              | bucketid |
+----+---------------+---------------------+----------+
|  4 |           678 | NULL                |     NULL |
|  6 |           500 | NULL                |     NULL |
|  5 |           459 | swt                 |       11 |
|  1 |           456 | blah_blah_tie_break |      111 |
|  3 |            77 | qwqqe               |       14 |
|  2 |            11 | z                   |       81 |
+----+---------------+---------------------+----------+

The significance of the cross join is merely to introduce and set starting values for 2 variables. That is all.

The results are the top ten books in descending order of purchasecount with the info from images if it exists (otherwise NULL) for the most upvoted image. The image selected honors tie-break rules picking the first one as mentioned above in the Visualization section with rownum.

Final Thoughts

I leave it to the OP to wedge in the appropriate where clause at the end as the sample data given had no useful book name to search on. That part is trivial. Oh, and do something about the schema for the large width of your primary keys. But that is off-topic at the moment.

Good lord, that has to be the most complicated query I have ever seen. Will it be significantly faster than querying the books table first to fetch the matching records from `books` table, then have the client fetch the most upvoted poster for each book returned in a separate query? — SexyBeast, Jun 09 '16 at 14:26
That query is like a 2 or 3 on a scale of 1 to 10. Alright maybe a 3 or 4. There are people here way smarter than me. — Drew, Jun 09 '16 at 14:36