4

I am having a huge issue turning a time stamped status table into a flat table that can be queried very quickly.

I essentially have a table like this:

╔══════════╦═══════════╦══════════╦══════════╦═══════════╦══════════╦
║ PersonID ║ Firstname ║ Lastname ║ status   ║ startdate ║ endate   ║  
║ 10233    ║ stacy     ║ adamns   ║ active   ║ 12-23-13  ║ 02-11-14 ║  
║ 10233    ║ stacy     ║ adamns   ║ pending  ║ 02-11-14  ║ 03-09-14 ║  
║ 10233    ║ stacy     ║ adamns   ║ inactive ║ 03-09-14  ║ 12-31-99 ║  
║ 10244    ║ steve     ║ smith    ║ active   ║ 01-07-14  ║ 12-31-99 ║  
╚══════════╩═══════════╩══════════╩══════════╩═══════════╩══════════╩

and turn it into:

╔══════════╦══════════╦═══════════╦══════════╦════════╗
║ Date     ║ PersonID ║ Firstname ║ Lastname ║ status ║
║ 12-23-13 ║ 10233    ║ stacy     ║ adamns   ║ active ║
║ 12-24-13 ║ 10233    ║ stacy     ║ adamns   ║ active ║
║ 12-25-13 ║ 10233    ║ stacy     ║ adamns   ║ active ║
║ 12-26-13 ║ 10233    ║ stacy     ║ adamns   ║ active ║
║          ║          ║           ║          ║        ║
╚══════════╩══════════╩═══════════╩══════════╩════════╝

This table has 28 additional columns with various things describing the employee (they are static and don't change e.g Height) and it is 48 million rows long...

I need to know how many employees were in the state of "active" for each day for the past 2 years.

Now with a smaller date range or set of data this is pretty easy, I would just join it with a calendar table with something similar to this:

Create Table People_history as
    Select Day_id,Firstname,Lastname,status
    from People
    Join Time_calendar on day_id between startdate and endate;

I have calculated that the resulting table will become 7.8 billion rows and over 3 terrabytes; however my database cannot even finish the query because it runs out of Temp memory. With a cursor I can get around the memory problem but it takes over 24 hours to run... I only need to do this once so maybe that's what I'll be stuck doing, but I figured I would ask you guys first.

Should I be looking at a different database to do this sort of analysis or just a more efficient method?

I've looked into Cassandra which recommends creating columns for the time intervals or MongoDB which you could throw the intervals and status into their own hash on each person. Are these good alternatives?

MKunz
  • 125
  • 5
  • 1
    since you only need to do it once, why don't you just split it to multiple execution? ie: by month or quarter by limit the records in your time_calendar – ah_hau Aug 05 '14 at 03:09
  • @ah_hau: What if the OP has to do it over and over in near future? Something like this SO question here: http://stackoverflow.com/questions/13037749/crystal-reports-need-to-group-by-derived-date-range – Rachcha Aug 05 '14 at 03:18
  • I would suggest store it into a table, run the job multiple time by chunk it off into multiple timeframe. We cannot generate 7.8billion rows on the fly when required. – ah_hau Aug 05 '14 at 03:27
  • @user3586892, perhaps you need to add another filter see it help. Create Table People_history as Select Day_id,Firstname,Lastname,status from People Join Time_calendar on day_id between startdate and endate and startdate > 'your start date'; perhaps it might help – ah_hau Aug 05 '14 at 03:29
  • Do you _need_ the table? Your attempts to query the destination table may end up slower just due to sheer _size_. What queries do you plan to run on the data - do you just have that simple aggregate? In which case it the original table seems perfectly setup for that query... Oh, your `endDate` should be queried exclusively (`<`), as that's what your db seems to be storing. – Clockwork-Muse Aug 05 '14 at 04:05
  • The reason I was trying to avoid aggregating is because a different question will arise for each one of those additional columns ("Is there growth in membership of left handed people 6 feet tall with brown hair?") I was assuming a table scan for that would be much faster than joining it with the calendar table each time. However, as you mentioned, the shear size of the table might actually make it slower? – MKunz Aug 05 '14 at 14:46
  • Why you cares about those additional columns? You should tell all the concerns you have. – Jaugar Chang Aug 07 '14 at 09:17

2 Answers2

2

The answers on Oracle forum here may help.

With the help of these answers I have come up with the following:

WITH date_ranges AS
         (    SELECT DISTINCT personid,
                              firstname,
                              lastname,
                              startdate + LEVEL - 1 AS date_i
                FROM myTable
          CONNECT BY LEVEL <= CEIL (endate - startdate) + 1)
  SELECT dr.date_i,
         dr.personid,
         dr.firstname,
         dr.lastname,
         (SELECT mt.status
            FROM myTable mt
           WHERE     mt.personid = dr.personid
                 AND dr.date_i BETWEEN mt.startdate AND mt.endate)
             AS status
    FROM date_ranges dr;

Please make necessary changes and use accordingly.

Rachcha
  • 8,486
  • 8
  • 48
  • 70
  • 3
    #winces# Ouch. First off, large rowsets with a subquery in the `SELECT` tend to get RBAR'd (row-by-agonizing-row) when it queries for the other information. Then there's `DISTINCT` and `ORDER BY`... I have a feeling `DISTINCT` can be removed by more careful application of date math. `ORDER BY` should be completely unnecessary - this is being dumped to a table, not displayed as a result set; at the given number of rows, you want to disable creation of indices until _after_ the rows added, and probably re-org the table to boot. – Clockwork-Muse Aug 05 '14 at 04:11
  • @Clockwork-Muse: Agreed. Order by is none of use here but as this query is going to execute just once, further tuning does not seem very necessary. IMHO a `DISTINCT` can be of not much harm here. Removing the `ORDER BY`. – Rachcha Aug 05 '14 at 04:26
  • 2
    You _are_ aware `DISTINCT` needs to hash/sort the rows in order to remove duplicates, right? Which you're doing over the **ENTIRE** set of rows - the multi-billion count, multi-terabyte stored set? How did you think it worked? I'd actually probably expect this to DOS the server _faster_ than a simple join with a calendar table... – Clockwork-Muse Aug 05 '14 at 04:45
  • Yes, that too can be a great approach. Right now I've not been able to find a way to remove this `DISTINCT`. Perhaps the OP could work on this part. – Rachcha Aug 05 '14 at 04:51
1

I need to know how many employees were in the state of "active" for each day for the past 2 years.

To reach the goal, you don't need to create a 7.8billion rows table. Just use the original table. I use an algorithm can calculate avg , sum by date or month only use a full table scan. Your requirment is very simple.

asume from_date is add_months(date'2014-08-05', -24), to_date is date'2014-08-05'.
Try this

select t1.*
from t1
where ( (startdate <= date'2014-08-05' and enddate > date'2014-08-05')
      or (startdate <=  and enddate > add_months(date'2014-08-05', -24))
      or (startdate >= add_months(date'2014-08-05', -24) and enddate < date'2014-08-05' ) )

Then you can get all user status within 2 years. This statement only need a fulltable scan, 48milion rows table should be done in several minutes.

Add status filter condition and distinct the personid, then you can get the result you want.

select distinct t1.personid,...
from t1
where ( (startdate <= date'2014-08-05' and enddate > date'2014-08-05')
      or (startdate <=  and enddate > add_months(date'2014-08-05', -24))
      or (startdate >= add_months(date'2014-08-05', -24) and enddate < date'2014-08-05' ) )
     and status = 'active'

UPDATE:
According to OP's requiement how many employees were in the state of "active" for each day for the past 2 years, my previous solution miss the each day requirement. To find out whether the state keep still for two years, should calculate the duration of a state.

Calculate the duration of status:

with temp as (select t1.*
from t1
where ( (startdate <= date'2014-08-05' and enddate > date'2014-08-05')
      or (startdate <= add_months(date'2014-08-05', -24) and enddate > add_months(date'2014-08-05', -24))
      or (startdate >= add_months(date'2014-08-05', -24) and enddate < date'2014-08-05' ) )
  and status = 2)
select temp.id,status,
sum(case when enddate < date'2014-08-05' 
      then enddate 
      else date'2014-08-05' 
    end
  - case when startdate > add_months(date'2014-08-05', -24) 
      then startdate 
      else add_months(date'2014-08-05', -24) 
    end) as duration
from temp
group by temp.id,status

Then filter the duration equal to 2 years duration, reach the goal.

having 
sum(case when enddate < date'2014-08-05' 
      then enddate 
      else date'2014-08-05' 
    end
  - case when startdate > add_months(date'2014-08-05', -24) 
      then startdate 
      else add_months(date'2014-08-05', -24) 
    end) = date'2014-08-05' - add_months(date'2014-08-05', -24) 

As I know , this is the most efficient way. Wish it helps.

Caution about those date compare conditions. I built a Sql Fiddle to help you to test.

Jaugar Chang
  • 3,176
  • 2
  • 13
  • 23