0

Not being an SQL expert, I am struggling with the following:

I inherited a larg-ish table (about 100 million rows) containing time-stamped events that represent stage transitions of mostly shortlived phenomena. The events are unfortunately recorded in a somewhat strange way, with the table looking as follows:

phen_ID   record_time  producer_id   consumer_id  state   ...

000123    10198789                               start
          10298776     000123        000112      hjhkk
000124    10477886                               start
          10577876     000124        000123      iuiii
000124    10876555                               end

Each phenomenon (phen-ID) has a start event and theoretically an end event, although it might not have been occured yet and thus not recorded. Each phenomenon can then go through several states. Unfortunately, for some states, the ID is recorded in either a product or a consumer field. Also, the number of states is not fixed, and neither is the time between the states.

To beginn with, I need to create an SQL statement that for each phen-ID shows the start time and the time of the last recorded event (could be an end state or one of the intermediate states).

Just considering a single phen-ID, I managed to pull together the following SQL:

WITH myconstants (var1) as (
   values ('000123')
)

select min(l.record_time), max(l.record_time) from 
   (select distinct *  from public.phen_table JOIN myconstants ON var1 IN (phen_id, producer_id, consumer_id)
 ) as l

As the start-state always has the lowest recorded-time for the specific phenomenon, the above statement correctly returns the recorded time range as one row irrespective of what the end state is.

Obviously here I have to supply the phen-ID manually.

How can I make this work that so I get a row of the start times and maxium recorded time for each unique phen-ID? Played around with trying to fit in something like select distinct phen-id ... but was not able to "feed" them automatically into the above. Or am I completely off the mark here?

Addition: Just to clarify, the ideal output using the table above would like something like this:

ID         min-time      max-time
000123     10198789      10577876   (min-time is start, max-time is state iuii)
000124     10477886      10876555   (min-time is start, max-time is end state)
GMB
  • 216,147
  • 25
  • 84
  • 135
Peter K.
  • 517
  • 3
  • 19

2 Answers2

1

I think you're on the right track. Try this and see if it is what you are looking for:

select
    min(l.record_time)
    ,max(l.record_time)
    ,coalesce(phen_id, producer_id, consumer_id) as [Phen ID]
from public.phen_table
group by coalesce(phen_id, producer_id, consumer_id)
DataGuy
  • 198
  • 1
  • 7
  • Am afraid this doesn't work as the ID is not always in the phen_id field. As mentioned above it, could be also in "producer_id" or "consumer_id", in which case phen_id would be empty. – Peter K. Nov 18 '20 at 22:25
  • When you have data in both producer_id and consumer_id fields, how do you choose which one to use? – DataGuy Nov 18 '20 at 22:29
  • An external process determines whether in the new recorded state the observed phenomenon is a producer or a consumer and puts the ID in that respective column. The same phenomenon cannot be both in any state. Unfortunately I cannot change that. There are millions of such phenomena in the DB, and is continuously growing. – Peter K. Nov 18 '20 at 22:37
  • Ok, just to be clear the three id fields will always only have one record per row? – DataGuy Nov 18 '20 at 22:39
  • I modified the answer. Can you run and see if that fits your needs? – DataGuy Nov 18 '20 at 22:46
  • An ID will be either in the phen-ID column or in the producer-id or consumer-id columen, but never in more than in one. When an ID is in the producer-id column, another ID would be in the consumer-ID column (and the phen-ID column would be empyt) – Peter K. Nov 18 '20 at 22:53
  • Check out the modifications I made earlier and see if it works for you. The coalesce function gets the first non null value in the list. I am naming it [Phen ID] but you can change it to whatever you would like. – DataGuy Nov 18 '20 at 22:55
1

union all might be an option:

select phen_id, 
    min(record_time) as min_record_time, 
    max(record_time) as max_record_time
from (
    select phen_id, record_time from phen_table
    union all select producer_id, record_time from phen_table
    union all select consumer_id, record_time from phen_table
) t
where phen_id is not null
group by phen_id

On the other hand, if you want prioritization, then you can use coalesce():

select coalesce(phen_id, producer_id, consumer_id) as phen_id, 
    min(record_time) as min_record_time, 
    max(record_time) as max_record_time
from phen_table
group by coalesce(phen_id, producer_id, consumer_id)

The logic of the two queries is not exactly the same. If there are rows where more than one of the three columns is not null, and values differ, then the first query takes in account all non-null values, while the second considers only the "first" non-null value.


Edit

In Postgres, which you finally tagged, the union all solution can be phrased more efficiently with a lateral join:

select x.phen_id, 
    min(p.record_time) as min_record_time, 
    max(p.record_time) as max_record_time
from phen_table p
cross join lateral (values (phen_id), (producer_id), (consumer_id)) as x(phen_id)
where x.phen_id is not null
group by x.phen_id
GMB
  • 216,147
  • 25
  • 84
  • 135