0

I need help regarding how to structure overlapping date ranges in my data warehouse. My objective is to model the data in a way that allows date-level filtering on the reports.

I have dimensions — DimEmployee, DimDate and a fact called FactAttendance. The records in this fact are stored as follows —

Fact Attendance

To represent this graphically —

Leave Management System

A report needs to be created out of this data, that will allow the end-user to filter it by making a selection of a date range. Let's assume user selects date range D1 to D20. On making this selection, the user should see the value for how many days at least one of the employees was on leave. In this particular example, I should see the addition of light-blue segments in the bottom i.e. 11 days.

An approach that I am considering is to store one row per employee per date for each of the leaves. The only problem with this approach is that it will exponentially increase the number of records in the fact table. Besides, there are other columns in the fact that will have redundant data.

How are such overlapping date/time problems usually handled in a warehouse? Is there a better way that does not involve inserting numerous rows?

Akshay Rane
  • 403
  • 4
  • 13

1 Answers1

0

Consider modelling your fact like this:

fact_attendance (date_id,employee_id,hours,...)

This will enable you to answer your original question by simply filtering on the Date dimension, but you will also be able to handle issues like leave credits, and fractional day leave usage.

Yes, it might use a little more storage than your first proposal, but it is a better dimensional representation, and will satisfy more (potential) requirements.

If you are really worried about storage - probably not a real worry - use a DBMS with columnar compression, and you'll see large savings in disk.

The reason I say "not a real worry" about storage is that your savings are meaningless in today's world of storage. 1,000 employees with 20 days leave each per year, over five years would mean a total of 100,000 rows. Your DBMS would probably execute the entire star join in RAM. Even one million employees would require less than one terabyte before compression.

Ron Dunn
  • 2,971
  • 20
  • 27
  • Yes, this is an approach I am considering. However, there are other fields in the fact table that are mainly leave-related metrics, which would repeat for each date if I increase the granularity to date-level. This will create an overhead of handling the repeated values in the measures. – Akshay Rane Mar 03 '16 at 18:20
  • Can you give an example of a leave-related metric that would need to be repeated? It will help to identify candidate solutions. – Ron Dunn Mar 03 '16 at 21:00
  • Well, some of the metrics that will have to be repeated are *IsPreApproved* and *IsLeaveWithoutPay* while there are a few others that do fit in with the table that has date-granularity such as *IsCasualLeave*, *IsSickLeave*, etc. – Akshay Rane Mar 08 '16 at 08:40
  • 1
    That doesn't look normal. LeaveWithoutPay, CasualLeave, SickLeave, etc. should probably be business keys to a LeaveType dimension. IsPreApproved is probably a degenerate dimension. – Ron Dunn Mar 08 '16 at 09:54