4

I am building a model to allow reporting on two seperate datasets, for this example we'l say a Students dataset & a Staff dataset.

The datasets are pretty seperate and the only real link between the two is Date, so from a model perspective, there is a Students star schema & a Staff Star Schema.

The data displayed is snapshot type data, answering questions like: - For a selected date, show all active employees - for a selected date, show all enrolled students

This means that when a single date is selected, the model then finds all employees where the selected date falls within the employment start & end date , and finds all students where the selected date falls within the enrolled start & end date.

This means i had to make a decision, how to return the correct data from each schema with a single date dimension. Creating a relationship would not work as relationships in Tabular dont allow "between" type queries, so i instead have one unrelated Date Dimension and the Dax for each model finds applicable rows.

The problem is that its not the most performant. for perhaps 50k rows, adding a measure can take 5-10 seconds.

Im asking if there is a better way to either write the queries, or alter the model to still let me do "between" style queries but give better performance.

Below is an example of a dax query to return all students that were enrolled on a particular date.

Thanks for any advice.

All Enrolled Students:=IF (
HASONEVALUE ( 'Date'[Date] ),
CALCULATE (
    DISTINCTCOUNT ( 'Students'[StudentID] ),
    FILTER (
        'Students',
        'Students'[StudentStartDateID] <= MIN ( 'Date'[DateID] )
            && 'Students'[StudentEndDateID] >= MAX ( 'Date'[DateID] )
    )
),
BLANK ())
JD_Sudz
  • 194
  • 1
  • 12
  • In order to see if it is possible you have to add a [MCVE](http://stackoverflow.com/help/mcve), tables with sample data could be useful. – alejandro zuleta Sep 09 '16 at 15:51

2 Answers2

1

Unrelated or "disconnected" tables are good for powering slicers, timelines, and filters in certain situations. As you said in your question, you have two optimization options: Re-structure your data set or optimize the existing measure syntax.

Re-Structure Dataset

Duplicate each row for every day between start and end dates with a column for that iterated date. This can be done a handful of ways depending on how you get your dataset, but could be tedious. Then, relate your tables on this iterated date and use the relation to filter from DATE to FACT. If this is a recurring report and/or you are using SQL to pull the data, this might be worth it to make use of PowerPivot's relational calculation power.

Optimize DAX statement

If this is a one-off request or the dataset would be too tedious to duplicate out by day, then stick with the disconnected table approach and clean up the measure syntax. Since you have already included the MIN() and MAX() functions and your CALCULATE() is returning DISTINCTCOUNT(), the conditional HASONEVALUE() function is unnecessary. I ran this in a simulated environment and had good results, but that can vary with computer performance and dataset size. See below for cleaned syntax.

All Enrolled Students:=CALCULATE (
    DISTINCTCOUNT('Students'[StudentID]),
    FILTER(
        'Students',
        'Students'[StudentStartDateID]<= MIN('Date'[DateID]) &&
        'Students'[StudentEndDateID]  >= MAX('Date'[DateID])
    )
)

If your StudentID column is unique, which would make sense to me, you can further speed this up.

All Enrolled Students:=CALCULATE (
    COUNT('Students'[StudentID]),
    FILTER(
        'Students',
        'Students'[StudentStartDateID]<= MIN('Date'[DateID]) &&
        'Students'[StudentEndDateID]  >= MAX('Date'[DateID])
    )
)

If StudentID is not a number replace COUNT() with COUNTA() to get the desired effect.

Community
  • 1
  • 1
BI_Badger
  • 43
  • 1
  • 1
  • 9
  • Creating a relationship wont work, at least wiht this design. E.g. A user wants to see all employees that are in employement on a particular day. The user selects the day 21/01/2016. If a relationship is present, the only rows returned are the rows in the fact table have a date of exactly 21/01/2016. But employee records are not stored like that, they have a start and end date. So what i actually want is to get all rows in the fact table, where the selected date falls between the start and end dates for that row. Il have a look at the DatesBetween function to see if it can help. – JD_Sudz Sep 09 '16 at 15:21
  • @JD_Sudz I think I understand the structure you're working with better now. Thanks for clarifying. I have updated my answer to reflect this. Please let me know if this helped. – BI_Badger Sep 09 '16 at 17:22
  • @JD_Sudz Does that make sense..? – BI_Badger Sep 14 '16 at 18:38
1

This type of scenario is often called "Events in progress" or "Events with a duration". Take a look at the links below. The answer will depend on your version of SSAS and the event duration length.

https://www.sqlbi.com/articles/analyzing-events-with-a-duration-in-dax/ https://www.sqlbi.com/articles/understanding-dax-query-plans/ https://blog.gbrueckl.at/2014/12/events-in-progress-for-time-periods-in-dax/

If these measures don't perform well (Which can happen with events that have a long duration), it may be necessary to generate a table containing a row for each day of the event. The SQL would look something like this:

SELECT        
   d.CalendarDate      
  ,s.StudentID
FROM dbo.Students AS s 
CROSS JOIN dbo.DimDate AS d      
WHERE d.CalendarDate >= StudentStartDateID      
AND d.CalendarDate <= StudentEndDateID

Create a relationship from this table to the date/calendar table.

With this design you can use a simple DISTINCTCOUNT(Students[StudentID]) measure, which should perform better. The trade-off is that this table can become quite large. Keep it as narrow as possible for best performance and memory conservation. Another optimization could be to use a different granularity such as week or month instead of day.

Chris Koester
  • 492
  • 4
  • 9
  • +1 for the model suggestion. This is what I suggest as well. Within a Tabular context, I've typically seen the performance and size become unreasonable for anything with more than single-digit millions of Students across more than a couple of years. There are very few scenarios I've come across where a client has too much data for this to make sense at a day grain. And there are *no* scenarios I've come across where a client has too much data for this to make sense at a month grain. If your dimensions are that large, you're typically beyond what can be reasonably handled in Tabular, anyway. – greggyb Aug 11 '19 at 14:21