-1

I'm trying to implement a nested query to pull no more than one sample per log, and I think I know how to implement its components separately:

  1. Query a set of logs that contain data relevant to my analysis:
    SELECT  
      runs.object_type as object_type,  
      runs.name as log_name,  
    from project.runs.latest_runs  
    WHERE object_type = "ROCKET"  
    group by object_type, log_name

This results in a list of log names, e.g. "log_name_2021_09_01", "log_name_2021_09_03" etc.

  1. Query no more than one event with a specific condition from a single known log:
    SELECT  
      object.path_meters as pos,  
      object_speed as speed,  
      log.run as run_name,  
    FROM project.events.last30days  
    WHERE log.run = "log_name_2021_10_01"  
      AND object.speed > 0.0  
    LIMIT 1  

The above query returns no more than one sample for the specified log.

How can I combine these queries to pull samples from a set of logs that is returned by Query 1, and at the same time there should be no more than one sample per log?

Update:
Let's say a DB contains three logs:

  1. log_name_2021_09_01. The associated object_type to the log is ROCKET. The log contains 100k data samples: 90k of them have object.speed = 0.0, 10k of them have speed > 0.0.
  2. log_name_2021_09_02. The associated object_type to the log is CAR. The log also contains 100k samples with a similar proportion to log 1.
  3. log_name_2021_09_03. The associated object_type to the log is ROCKET. The log also contains 100k samples with a similar proportion to log 1.

I'm only interested in logs with the object type ROCKET. Two logs correspond to this condition: log_name_2021_09_01 and log_name_2021_09_03. These log names can be obtained by query 1 depicted above. I'd like to pull only one sample point (with speed > 0) from each of the two logs. That is, in the end I'd like to have a query that returns two samples: one from log_name_2021_09_01 and one from log_name_2021_09_03.

Thorsten Kettner
  • 89,309
  • 7
  • 49
  • 73
vadbut
  • 39
  • 6
  • What DBMS are you using? Double quotes in SQL are for names, not for strings, so this should be `WHERE object_type = 'ROCKET'` instead. And if you want a unique list of log names then `select distinct name from runs where object_type = 'ROCKET'`. And you don't care which event you pick per log? It can be the same every time you run the query or be different every time; this doesn't matter? – Thorsten Kettner Oct 01 '21 at 17:25
  • 1) I use Dremel. The double quote is the correct syntax for my DB, please leave it aside. 2) What do you mean by distinct name logs? They are distinct by definition - query 1 is guaranteed to return a list of unique log names. 3) it doesn't matter which sample is picked in a log. I just need any sample corresponding to object.speed >0.0 condition – vadbut Oct 01 '21 at 17:41
  • Could you show sample of your desired result? – Bijan Ghasemi Oct 01 '21 at 18:02
  • I added an example to the original post. Hope it makes things clearer. – vadbut Oct 01 '21 at 18:13
  • Sorry, I cannot help here. I read that Dremel has an SQL like language. I know SQL, but I don't know the Dremel query language. I've added a `dremel` tag to your request, so others may jump in. As to distinct log names: you are using a `GROUP BY` query for this. In SQL you'd rather use `SELECT DISTINCT`. – Thorsten Kettner Oct 01 '21 at 21:00

1 Answers1

1

You question omits actual example data, so we're forced to infer much from your description. It is strongly recommended that your questions include sample data and the results you'd want from that sample data. This allows us both a concrete example to base our understanding on, and provides a test-set for us to use when developing an answer.

  • Would you trust any code you've written without having run it against test data?

That said, the following should be something like what you're looking for... (For each log, it only selects the one row with the highest pos.)

WITH
  rocket_logs AS
(
  SELECT DISTINCT
    runs.object_type AS object_type,
    runs.name        AS log_name
  FROM
    project.runs.latest_runs  
  WHERE
    object_type = "ROCKET"  
),
  sorted_logs AS
(
  SELECT  
    log.run              AS run_name,
    object.path_meters   AS pos,
    object_speed         AS speed,
    ROW_NUMBER()
      OVER (
        PARTITION BY log.run
            ORDER BY object.path_meters DESC
      )
                         AS seq_num
  FROM
    project.events.last30days  
  WHERE
    object.speed > 0.0
)
SELECT
  *
FROM
  rocket_logs   r
INNER JOIN
  sorted_logs   s
    ON s.run_name = r.log_name
WHERE
  s.seq_num = 1

For a more exact answer, please give:

  • example data, for both tables
  • example results for that data
  • both being sufficient to demonstrate all necessary behaviours
MatBailie
  • 83,401
  • 18
  • 103
  • 137
  • I provided an example in the original post. Your code snippet appears to be working, even though it doesn't seem to be optimal: my mental model tells me that it crawls through all sample points in logs to aggregate sorted_logs (even though only one should be kept). I wasn't expecting it to be so complex in SQL to query a list and traverse through it with smth like a while-loop to use its entries in another query. Seems odd that SQL doesn't allow smth like that easily. – vadbut Oct 01 '21 at 21:44
  • @vadbut Your mental model is flawed. SQL isn't an imperative language, it's a declarative language. From the expressed problem the DBMS generates an execution plan. Predicate pushdown, macro like expansion, and other techniques mean this is likely Much more efficient than you give credit. Equally, SQL isn't a linear language, it's a set based language, thinking in while loops and list traversal is simply incorrect. I recommend testing the solutions, and learning the techniques, rather than judging them based on faulty premises. Discard your mental model and research execution plans. – MatBailie Oct 01 '21 at 21:54
  • Also, you didn't provide an example. An example includes concrete data, not notional descriptions. Actual table contents, expressing the corner cases that demonstrate both what should and should not happen. If there isn't enough detail in the post to build my own replica to test on, the question is incomplete. https://stackoverflow.com/help/minimal-reproducible-example – MatBailie Oct 01 '21 at 21:59