1

I currently have to wrap my head around programming the following task. Situation: suppose we have one column where we have time data (Year-Month-Day Hours-Minutes). Our program shall get the input (weekday, starttime, endtime, timeslot) and we want to return the interval (specified by timeslot) where there are the least values. For further information, the database has several million entries. So our program would be specified as

def calculate_optimal_window(weekday, starttime, endtime, timeslot):
       return optimal_window

Example: suppose we want to input

weekday = Monday, starttime = 10:00, endtime = 12:00, timeslot = 30 minutes.

Here we want to count how many entries there are between 10:00 and 12:00 o'clock, and compute the number of values in every single 30 minute slot (i.e. 10:00 - 10:30, 10:01 - 10:31 etc.) and in the end return the slot with the least values. How would you go about formulating an efficient query?

Since I'm working with an Oracle SQL database, my second question is: would it be more efficient to work with libraries like Dask or Vaex to get the filtering and counting done? Where is the bottleneck in this situation?

Happy to provide more information if the formulation was too blurry.

All the best.

kallikles
  • 83
  • 1
  • 2
  • 6

2 Answers2

1

This part:

Since I'm working with an Oracle SQL database, my second question is: would it be more efficient to work with libraries like Dask or Vaex to get the filtering and counting done? Where is the bottleneck in this situation?

Depending on your server's specs and the cluster/machine you have available for Dask, it is rather likely that the bottleneck in your analysis would be the transfer of data between the SQL and Dask workers, even in the (likely) case that this can be efficiently parallelised. From the DB's point of view, selecting data and serialising it is likely at least as expensive as counting in a relatively small number of time bins.

I would start by investigating how long the process takes with SQL alone, and whether this is acceptable, before moving the analysis to Dask. Usual rules would apply: having good indexing and sharding on the time index.

mdurant
  • 27,272
  • 5
  • 45
  • 74
0

You should at least do the basic filtering and counting in the SQL query. With a simple predicate, Oracle can decide whether to use an index or a partition and potentially reduce the database processing time. And sending fewer rows will significantly decrease the network overhead.

For example:

select trunc(the_time, 'MI') the_minute, count(*) the_count
from test1
where the_time between timestamp '2021-01-25 10:00:00' and timestamp '2021-01-25 11:59:59'
group by trunc(the_time, 'MI')
order by the_minute desc;

(The trickiest part of these queries will probably be off-by-one issues. Do you really want "between 10:00 and 12:00", or do you want "between 10:00 and 11:59:59"?)

Optionally, you can perform the entire calculation in SQL. I would wager the SQL version will be slightly faster, again because of the network overhead. But sending one result row versus 120 aggregate rows probably won't make a noticeable difference unless this query is frequently executed.

At this point, the question veers into the more subjective question about where to put the "business logic". I bet most programmers would prefer your Python solution to my query. But one minor advantage of doing all the work in SQL is keeping all of the weird date logic in one place. If you process the results in multiple steps there are more chances for an off-by-one error.

--Time slots with the smallest number of rows.
--(There will be lots of ties because the data is so boring.)
with dates as
(
    --Enter literals or bind variables here:
    select
        cast(timestamp '2021-01-25 10:00:00' as date) begin_date,
        cast(timestamp '2021-01-25 11:59:59' as date) end_date,
        30 timeslot
    from dual
)
--Choose the rows with the smallest counts.
select begin_time, end_time, total_count
from
(
    --Rank the time slots per count.
    select begin_time, end_time, total_count,
        dense_rank() over (order by total_count) smallest_when_1
    from
    (
        --Counts per timeslot.
        select begin_time, end_time, sum(the_count) total_count
        from
        (
            --Counts per minute.
            select trunc(the_time, 'MI') the_minute, count(*) the_count
            from test1
            where the_time between (select begin_date from dates) and (select end_date from dates)
            group by trunc(the_time, 'MI')
            order by the_minute desc
        ) counts
        join
        (
            --Time ranges.
            select
                begin_date + ((level-1)/24/60) begin_time,
                begin_date + ((level-1)/24/60) + (timeslot/24/60) end_time
            from dates
            connect by level <=
            (
                --The number of different time ranges.
                select (end_date - begin_date) * 24 * 60 - timeslot + 1
                from dates
            )
        ) time_ranges
        on the_minute between begin_time and end_time
        group by begin_time, end_time
    )
)
where smallest_when_1 = 1
order by begin_time;

You can run a db<>fiddle here.

Jon Heller
  • 34,999
  • 6
  • 74
  • 132