1

I have a table which is having filenames and a number of timestamp ranges for each file for eg as below:

       filename           mindatetime                 maxdatetime
    monitor_4.dat   2019-04-28 09:00:00 AM      2019-04-29 11:00:00 AM
    monitor_4.dat   2019-04-28 11:00:00 AM      2019-04-29 18:00:00 PM
    monitor_4.dat   2019-04-28 09:30:00 AM      2019-04-29 23:00:00 PM
    monitor_5.dat   2019-04-28 07:00:00 AM      2019-04-28 10:00:00 AM
    monitor_5.dat   2019-04-28 02:00:00 PM      2019-04-28 06:00:00 PM
    monitor_5.dat   2019-04-28 09:00:00 AM      2019-04-28 03:00:00 PM
    monitor_7.dat   2019-04-21 03:06:26.0 AM    2019-05-21 03:06:10.0 AM
    monitor_7.dat   2019-05-21 03:06:10.001 AM  2019-05-24 03:06:11.0 AM
    monitor_7.dat   2019-06-05 03:06:18.001 AM  2019-06-06 03:06:11.0 AM
    monitor_7.dat   2019-05-24 03:06:11.001 AM  2019-06-05 03:06:18.0 AM
    monitor_7.dat   2019-05-12 07:00:10.001 AM  2019-05-13 10:00:10.000 AM
    monitor_7.dat   2019-05-15 09:30:10.001 AM  2019-05-18 11:30:10.000 AM

I have a requirement to remove the redundant timestamp ranges ie> the ones which fall under a given timestamp range. In this case for the file "monitor_5.dat" we need the mindatetime as 7AM and maxdatetime as 6PM as they form the logical min and max timestamp ranges which would cover the other entries.

So my resultant output should look like:

   filename           mindatetime                 maxdatetime
monitor_4.dat   2019-04-28 09:00:00 AM      2019-04-29 23:00:00 PM
monitor_5.dat   2019-04-28 07:00:00 AM      2019-04-28 06:00:00 PM
monitor_7.dat   2019-04-21 03:06:26.0 AM    2019-06-05 03:06:18.0 AM
monitor_7.dat   2019-05-12 07:00:10.001 AM  2019-05-13 10:00:10.000 AM
monitor_7.dat   2019-05-15 09:30:10.001 AM  2019-05-18 11:30:10.000 AM

This needs to be achieved through SQL only. Any suggestions would be helpful. I have already explored the "tsrange" function provided in Postgres but that is not helping me completely.

Abhilash
  • 803
  • 1
  • 9
  • 32

2 Answers2

1

This answer is for PostgreSQL only; did you tag MySQL deliberately?

You'd have to join the table with itself and delete the entries contained in other entries:

DELETE FROM mytable AS a
USING mytable AS b
WHERE a.filename = b.filename
  AND tsrange(a.mindatetime, a.maxdatetime) <@ tsrange(b.mindatetime, b.maxdatetime)
  AND (a.ctid, a.xmin::text) <> (b.ctid, b.xmin::text);

The last condition prevents that a row is compared to itself.

Laurenz Albe
  • 209,280
  • 17
  • 206
  • 263
1

This is a form of gaps-and-islands problem. I would recommend finding where the "overlaps" occur and then using a cumulative sum to identify the groups. A cumulative max() works for this:

select filename, min(mindatetime) as mindatetime,
       max(maxdatetime) as maxdatetime
from (select t.*,
             sum(case when prev_maxdatetime >= mindatetime then 0 else 1 end) over
                 (partition by filename order by mindatetime) as grp
      from (select t.*,
                   max(maxdatetime) over
                       (partition by filename
                        order by mindatetime
                        rows between unbounded preceding and 1 preceding
                       ) as prev_maxdatetime
            from t
           ) t
     ) t
group by filename, grp;

The innermost subquery determines where a non-overlapping time range starts. The middle query then does a cumulative sum of these "starts" to assign a grouping identifier to each group. The outer query then aggregates by this group (and the filename).

You can run the inner queries and see what values they produce.

Here is a db<>fiddle. As far as I'm concerned, this is working correctly according to the question that you have asked here. If you have a time lag where you want to consider the values the same/overlapping, then ask a new question with a clear explanation of your logic.

EDIT:

If you want a single record per filename, then just use aggregation:

select filename, min(mindatetime) as mindatetime,
       max(maxdatetime) as maxdatetime
from t
group by filename;

The first version combines rows that have no gaps between them. This just takes the earliest and latest date/times.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • This logic works for me. Can you please explain in brief what this query is trying to do – Abhilash Jun 17 '19 at 12:09
  • Its failing for filename "monitor_7.dat" – Abhilash Jun 18 '19 at 05:09
  • @Abhilash28Abhi . . . First "failing" is not very descriptive of what is happening, so there is no way to help. The issue is, no doubt, due to the milliseconds on the timestamps. Be sure that your timestamp represents them correctly. – Gordon Linoff Jun 18 '19 at 10:25
  • What i meant was its not giving me a single output for that case instead it returns all the rows for that systemId. The reason I added this now was because of huge records in the table and its difficult for me to add all the scenarios at once. – Abhilash Jun 18 '19 at 11:19
  • Ok..it was not that straight forward. I have included the records which are not continuous for which the min and max logic will not work out – Abhilash Jun 18 '19 at 11:48