2

I'm a complete newbie in sql (and any kind of coding), but I'm trying to write a basic query that would return, by country, the number of users who logged in for the first time after august 15th, and the number of users who came back on the day following their first session.

I'm working with a table called events that includes the following columns: utc_timestamp, name, id and a json string that includes several parameters to this event (as you can see, i'm using it to retrieve the session number and the country)

When I'm running this query, it says "Line 5: cannot recognize input near 'SELECT' 'DISTINCT' 'id' in function specification". I tried putting bracket between select and DISTINCT, i still keep getting the same error message. Any idea on what is causing it?

Thanks for your help guys

SELECT 
get_json_object(json, '$. User_Country ') AS country
, COUNT(DISTINCT id) AS Users
, COUNT(
    SELECT DISTINCT id 
    FROM events 
    WHERE EXISTS(
        SELECT * 
        FROM events
        WHERE name = "Logged_in" 
        AND utc_timestamp>(
            (
            SELECT utc_timestamp
            FROM events
            WHERE month = 201408
            AND name = "Logged_in" 
            AND get_json_object(json, '$. Session_nb ') = 0
            AND utc_timestamp > UNIXTIMESTAMP('2014-08-15 12:00:00')
            ) + INTERVAL '1 day'
        )
    )
) AS Retained1
FROM events 
WHERE month = 201408 
AND name = "Logged_in" 
AND get_json_object(json, '$. Session_nb ') = 0
AND utc_timestamp > UNIXTIMESTAMP('2014-08-15 12:00:00')
GROUP BY (get_json_object(json, '$. User_Country ')) 
ORDER BY (get_json_object(json, '$. User_Country '))
QuiZzer
  • 23
  • 3
  • I like your definition of a 'basic query' :-) What DBMS are you using: MySQL, MS SQL Server, PostgreSQL, ... ? You can edit the question and add a tag. – Josien Sep 12 '14 at 09:01
  • Yes sorry I just had to check the tool faq to see that it was actually hiveql. – QuiZzer Sep 12 '14 at 09:32
  • You may have a better performance if you select an indexed column in the `EXISTS` subquery. Something like `SELECT id FROM events` instead of select all the columns (`SELECT * FROM events`) as you're doing. In MS SQL Server I'd do `SELECT '1' FROM events` - I don't know if your RDBMS supports it. – Caffé Sep 12 '14 at 12:33
  • Good to know, I'll do that, thanks. – QuiZzer Sep 12 '14 at 12:49

1 Answers1

1

Don't count the subquery result. Instead, use COUNT in the subquery itself.

SELECT 
get_json_object(json, '$. User_Country ') AS country
, COUNT(DISTINCT id) AS Users
, (
    SELECT COUNT(DISTINCT id)
    FROM events 
    WHERE EXISTS(
        SELECT * 
        FROM events
        WHERE name = "Logged_in" 
        AND utc_timestamp>(
            (
            SELECT utc_timestamp
            FROM events
            WHERE month = 201408
            AND name = "Logged_in" 
            AND get_json_object(json, '$. Session_nb ') = 0
            AND utc_timestamp > UNIXTIMESTAMP('2014-08-15 12:00:00')
            ) + INTERVAL '1 day'
        )
    )
) AS Retained1
FROM events 
WHERE month = 201408 
AND name = "Logged_in" 
AND get_json_object(json, '$. Session_nb ') = 0
AND utc_timestamp > UNIXTIMESTAMP('2014-08-15 12:00:00')
GROUP BY (get_json_object(json, '$. User_Country ')) 
ORDER BY (get_json_object(json, '$. User_Country '))
Caffé
  • 1,161
  • 1
  • 9
  • 19
  • Thanks but I still get a similar error message: "Line 5: cannot recognize input near 'SELECT' 'COUNT' '(' in expression specification" But apparently I've got other things to work on since i just realized that the DBMS does not support the INTERVAL command :( – QuiZzer Sep 12 '14 at 12:48