12

I am building a query that performs some filtering on rating data.

Suppose I have a simple table called ratings like the following, storing data from an online rating tool:

+----------------+----------------+--------+
| page_title     | timestamp      | rating |
+----------------+----------------+--------+
| Abc            | 20110417092134 | 1      |
| Abc            | 20110418110831 | 2      |
| Def            | 20110417092205 | 3      |
+----------------+----------------+--------+

I need to extract pages with a high frequency of low values in the latest 10 ratings, and limit this query to pages that produced a volume of at least 20 ratings in the preceding week. This is the ridiculously long query I came up with:

SELECT a1.page_title, COUNT(*) AS rvol, AVG(a1.rating) AS theavg, 
(
     SELECT COUNT(*) FROM
     (
         SELECT * FROM ratings a2 WHERE a2.page_title = a1.page_title 
         AND DATE(timestamp) <= '2011-04-24' ORDER BY timestamp DESC LIMIT 10
     ) 
     AS latest WHERE rating >=1 AND rating <=2 ORDER BY timestamp DESC
)
AS lowest FROM ratings a1
WHERE DATE(a1.timestamp) <= "2011-04-24" AND DATE(a1.timestamp) >= "2011-04-17" 
GROUP BY a1.page_title HAVING COUNT(*) > 20

the top level query looks for pages with more than 20 ratings in the week terminating on 2011-04-24, the subquery is supposed to retrieve the number of ratings with values between [1,2] from the latest 10 ratings of each article from the top level query.

MySQL complains that a1.page_title in the WHERE clause of the subsubquery is an unknown column, I suspect this is because a1 is not defined as an alias in the second-level query, but only in the top-level query, but I am clueless how to fix this.

(edited)

I am adding as an explanation of my suspect above regarding cross-level referencing another query which works absolutely fine, note that here a1 is not defined in the subquery but it is in the immediate parent:

SELECT a1.page_title, COUNT(*) AS rvol, AVG(a1.rating) AS theavg, 
(
    SELECT COUNT(*) FROM ratings a2 WHERE DATE(timestamp) <= '2011-04-24'
    AND DATE(timestamp) >= '2011-04-17' AND rating >=1 
    AND rating <=2 AND a2.page_title = a1.page_title
) AS lowest FROM ratings a1 
WHERE DATE(a1.timestamp) <= '2011-04-17' AND DATE(a1.aa_timestamp) >= '2011-04-11' 
GROUP BY a1.page_title HAVING COUNT(*) > 20
radrat
  • 335
  • 1
  • 4
  • 11

3 Answers3

5

I think you might consider joining two in line views it might make things eaiser.

SELECT * 
FROM   (SELECT COUNT(*), 
               a2.page_title 
        FROM   ratings a2 
        WHERE  DATE(timestamp) <= '2011-04-24' 
               AND DATE(timestamp) >= '2011-04-17' 
               AND rating >= 1 
               AND rating <= 2 

        GROUP  BY a2.page_title) current 
       JOIN 
        (SELECT a1.page_title, 
                    COUNT(*)       AS rvol, 
                    AVG(a1.rating) AS theavg 
             FROM   ratings a1 
             WHERE  DATE(a1.timestamp) <= '2011-04-17' 
                    AND DATE(a1.a_timestamp) >= '2011-04-11' 
             GROUP  BY a1.page_title 
             HAVING COUNT(*) > 20) morethan20 
         ON current .page_title = morethan20.page_title 
Conrad Frix
  • 51,984
  • 12
  • 96
  • 155
  • Hi Conrad, this one doesn't work as a1 in the first subquery is not defined in the parent query (SQL fails with error: Unknown column 'a1.page_title' in 'where clause') – radrat Apr 25 '11 at 23:57
  • @radrat yep you're right sorry about that. I've updated the answer. Basically the join happens out side the context of the subqueries – Conrad Frix Apr 26 '11 at 05:55
  • thanks for the update – with a few changes your solution pointed me in the right direction. – radrat Apr 27 '11 at 16:58
1

If all you have is this one simple table, I have no idea where you are pulling all of these other table names from, such as: a1, a2, ratings. I feel like either your SQL is quite a bit off, or your leaving out information.

The reason your having the error you do is because in your sub-sub-query you do not include a1 in your "FROM" statement... as so that table is not included, it cannot be referenced in your WHERE clause in that sub query.

SELECT * 
FROM
    (SELECT *
        FROM a1
        WHERE a1.timestamp <= (NOW()-604800)
            AND a1.timestamp >= (NOW()-1209600)
        GROUP BY a1.page_title
        HAVING COUNT(a1.page_title)>20)
    AS priorWeekCount
WHERE
    rating <= 2
ORDER BY timestamp DESC
LIMIT 10

as I dont have a full table to test this... I THINK this is what your looking for.. but it is untested, and knowing my coding habits, very rarely is what I type 100% perfect first time ;)

CenterOrbit
  • 6,446
  • 1
  • 28
  • 34
  • welcome to stack overflow! this strikes me as more of a comment, since it doesn't actually answer the question. – colinmarc Apr 25 '11 at 19:47
  • "a1" and "a2" are table aliases required to cross-reference fields across queries of different levels (see for instance [this example](http://stackoverflow.com/questions/1973246/how-to-specify-the-parent-query-field-from-within-a-subquery-in-mysql). "ratings" is the name of the main table, I omitted it because I thought it was trivial, sorry for the confusion. "latest" and "lowest" are aliases needed because of the nested SELECTs. – radrat Apr 25 '11 at 20:00
  • Hi colinmarc, thanks for the welcome!... I think because I have low points and just starting on stack overflow, it will not let me leave comments on the main questions. As you are right, this is more of a comment... but I am working on the SQL and will be done shortly. – CenterOrbit Apr 25 '11 at 20:00
  • I understand aliases, and cross-referencing.. but the problem with your SQL currently is that you never give the name of "a1" and "a2" as an alias to any of your sub-queries, you have names given such as: theavg, latest, and lowest... no a1 or a2 ever mentioned – CenterOrbit Apr 25 '11 at 20:03
  • Not including a definition for a1 in the subsubquery is not the cause of the problem: this is actually what you do when you refer to a field in a parent query (see the link I posted above). My concern is that this cross-level reference may only work for the immediate parent query, whereas in my SQL above I refer to a query 2 levels up. – radrat Apr 25 '11 at 20:04
  • I did look at the link you posted but I have never done an SQL statement this way. see my edited SQL above and let me know if this is more of on the right track. "a1" is the name of the table in which you want to pull your data from... since I didnt know what the name of your example table was – CenterOrbit Apr 25 '11 at 20:11
  • Andrew, what I am trying to achieve (and what both Conrad and Andrew above are trying to help with) is a case of correlated subqueries, see: http://dev.mysql.com/doc/refman/5.5/en/correlated-subqueries.html Your solution actually uses a subquery that doesn't produce any useful data to which the WHERE statement of the outer query can be applied. If you group by page_title in the subquery you cannot then filter by rating value in the outer query. – radrat Apr 26 '11 at 00:01
1

Your analysis of the error is correct: lowest is known in the subquery, a1 is not.

I think the logic is inside-out. The following probably isn't the best but the optimizer may be smart enough to combine the two subqueries in the outermost SELECT. (If it isn't, at the risk of readability you can introduce another level of subquery.)

SELECT r20plus.page_title,
 AVG((SELECT rating 
      FROM ratings r WHERE r.page_title=r20plus.page_title 
      ORDER BY timestamp DESC LIMIT 10) ) as av,
 SUM((SELECT CASE WHEN rating BETWEEN 1 AND 2 THEN 1 ELSE 0 END 
      FROM ratings r WHERE r.page_title=r20plus.page_title
      ORDER BY timestamp DESC LIMIT 10) ) as n_low,
FROM
(SELECT page_title FROM ratings  
WHERE DATE(a1.timestamp) <= "2011-04-24" AND DATE(a1.timestamp) >= "2011-04-17"
GROUP BY page_title
HAVING COUNT(rating) >= 20) AS r20plus;
Andrew Lazarus
  • 18,205
  • 3
  • 35
  • 53
  • Hi Andrew, it looks like this solution won't work: the subqueries embedded in AVG and SUM both return more than 1 row and you cannot SELECT sums or averages over a series of results returned by a subquery (the only valid use of these functions is in a query with an explicit GROUP BY clause). Or am I missing something? – radrat Apr 25 '11 at 23:54
  • I'm not familiar with MySQL limitations (use Postgres), but in the absence of a GROUP BY the aggregate (SUM;AVG) *should* be taken over all of the returned rows-> only one row in the answer. (`GROUP BY NULL` should also be legal.) It would also be possible to do this with windowing functions, but I know MySQL lacks them. – Andrew Lazarus Apr 26 '11 at 00:22