Query Database Accurately Based on Timestamp

Question

I am currently having an accuracy issue when querying price vs. time in a Google Big Query Dataset. What I would like is the price of an asset every five minutes, yet there are some assets that have an empty row for an exact minute.

For example, with VEN vs ICX which are two cryptocurrencies, there might be a time at which price data is not available for a specific second. In my query, I am querying a database for every 300 seconds and taking the price data, yet some assets don't have a timestamp for 5 minutes and 0 seconds. Thus, I would like the get the last known price: a good price to use would be 4 minutes and 58 seconds.

My query right now is:

SELECT MIN(price) AS PRICE, timestamp
FROM [coin_data] 
WHERE coin="BTCUSD" AND TIMESTAMP_TO_SEC(timestamp) % 300 = 0
GROUP BY timestamp
ORDER BY timestamp ASC

This query results in this sort of gap in specific places:

Row((10339.25, datetime.datetime(2018, 2, 26, 21, 55, tzinfo=<UTC>)))

Row((10354.62, datetime.datetime(2018, 2, 26, 22, 0, tzinfo=<UTC>)))

Row((10320.0, datetime.datetime(2018, 2, 26, 22, 10[should be 5 for 5 min], tzinfo=<UTC>)))

This one should not be 10 in the last column as that is the minutes place and it should read 5 mins.

A.Queue · Answer 1 · 2018-03-22T09:12:51.463

In order to select a row that has a 5 minute mark/timestamp if it exists, or the closest existing entry, you can use "(analytic) window functions"(uses OVER()) instead of aggregate functions(uses GROUP BY), as following:

group all rows into "separate" 5 minute groups
sort them by proximity to the desired time
select the first row from each partition.

Here I am using OVER clause to create the "window frames" and sorts the rows in them. Then RANK() numbers all rows in each window frame as they are sorted.

Standard SQL

WITH
  data AS (
  SELECT *,
    CAST(FLOOR(UNIX_SECONDS(timestamp)/300) AS INT64) AS timegroup
  FROM
    `coin_data` )
SELECT min(price) as min_price, timestamp
FROM
 (SELECT *, RANK() OVER(PARTITION BY timegroup ORDER BY timestamp ASC) AS rank
  FROM data)
WHERE rank = 1
group by timestamp
ORDER BY timestamp ASC

Legacy SQL

SELECT MIN(price) AS min_price, timestamp
FROM (
  SELECT *,
RANK() OVER(PARTITION BY timegroup ORDER BY timestamp ASC) AS rank,
  FROM (
    SELECT *,
      INTEGER(FLOOR(TIMESTAMP_TO_SEC(timestamp)/300)) AS timegroup
    FROM [coin_data]) AS data )
WHERE rank = 1
GROUP BY timestamp
ORDER BY timestamp ASC

It seems that you have many prices for the same time stamp in which case you may want to add another field to OVER clause.

OVER(PARTITION BY timegroup, exchange ORDER BY timestamp ASC)

Notes:

Consider migrating to Standard SQL, which is the preferred SQL dialect for querying data stored in BigQuery. You can do that on single query basis, so you don't have to migrate everything at the same time.
My idea was to provide a general query that would illustrate the principle so I don't filter for empty rows, because it's not clear if they are null or empty string and it's not really necessary for the answer.

Query Database Accurately Based on Timestamp

1 Answers1

Standard SQL

Legacy SQL

Notes: