0

I have a problem with PowerPivot. Let's have a look at only 3 columns in my data source:

date - clientid - category

Category can only be 1 or 2. In the data source you can find often the same clientid for a given time period, sometimes with different category.

So in my pivot table, I can see the distinct count of my clients depending on the chosen timeline. But, of course, the sum of clients for cat=1 and cat=2 is bigger than the distinct count. Is it possible to count only the newest entries for every clientid, so that the sum of the two cats is the same as the distinct count of my clients?

Thanks in advance to everybody who helps and spend his time for me.

Stefan

1 Answers1

1

This was fun! Thanks for an interesting problem. Normally for this sort of thing we might flag the most recent entry for a given clientid in an extra field, but yours needs to be dynamic at runtime based on your date filter selection.

Here we go. Be warned, it's a doozy.

CountCat:=
COUNTROWS(
    FILTER(
        GENERATE(
            VALUES( ClientCats[clientid] )
            ,CALCULATETABLE(
                SAMPLE(
                    1
                    ,SUMMARIZE(
                        ClientCats
                        ,ClientCats[date]
                        ,ClientCats[category]
                    )
                    ,ClientCats[date]
                    ,DESC
                )
                ,ALL( ClientCats[category] )
            )
        )
        ,CONTAINS(
            VALUES( ClientCats[category] )
            ,ClientCats[category]
            ,ClientCats[category]
        )
    )
)

Let's work through it.

COUNTROWS() is trivial.

FILTER() takes a table as its first argument. It creates a row context by iterating row-by-row through this table. It evaluates a boolean expression in each row context and returns the rows for which the expression returns true. We're not getting to that expression for a little while here. Let's look at the table we'll be filtering.

GENERATE() takes a table as its input and creates a row context by iterating row-by-row through that table. For each row context it evaluates a second table, and cross joins the rows that exist in the second table expression in the current row context from the first table with the row from the first table.

Our first table is VALUES( ClientCats[clientid] ), which is simply a distinct list of all [clientid]s in context from the pivot table.

We then evaluate CALCULATETABLE() for each row context, aka for each [clientid]. CALCULATETABLE() evaluates a table expression in the filter context determined by its second and subsequent arguments.

SAMPLE() is the table we'll evaluate. SAMPLE() is like TOPN(), but with ties broken non-deterministically. SAMPLE( 1, ... ) always returns one row. TOPN( 1, ... ) returns all rows that are tied for first position.

SAMPLE(), here, will return one row from the table defined by SUMMARIZE(). SUMMARIZE() groups by the fields in a table that are named. Thus we have a table of all distinct values of [date] and [category] that are included based on the context determined by our CALCULATETABLE(). SAMPLE()'s third argument defines a sort-by column to determine which rows are first, and its fourth determines the sort order. Thus for each [clientid] we are returning the latest row in the SUMMARIZE() for that [clientid].

The ALL() in our CALCULATETABLE() strips the context from the field [category] that might be coming in from our pivot table. This means that every time we evaluate our GENERATE() (remember we're still in that function here), we get a table of all [clientid]s that exist in context, and their most recent [category], even when we're evaluating in a pivot cell that has filtered [category].

That sounds like a problem - we'd expect the same count now for every pivot cell. And that's what we'd get if we did COUNTROWS( GENERATE() ). But wait, we're still in FILTER()!

Now we get to the boolean expression which will filter the rows of that GENERATE(). CONTAINS() takes a table as its first argument, a reference to a column in that table as its second argument, and a scalar value as its third argument. It returns true if the column in argument 2, of the table in argument 1, contains the value in argument 3.

We are outside of the CALCULATETABLE(), and therefore context exists on [category]. VALUES() returns the unique rows in context. In any pivot cell filtered by [category], this will be a 1x1 table, but in our grand total, it will have multiple rows.

So, the column in that VALUES() we want to test is [category] (the only column that exists in that VALUES()). The value we want to test for is referred to by ClientCats[category]. That third argument evaluates [category] in the row context determined by FILTER(). Thus we return true for every row that matches the current filter context (in a pivot cell) of ClientCats[category]. Mind-bending stuff here.

Anyway, the upshot is that in a [category]-filtered pivot cell, we get the number of distinct [clientid]s that have, for the time frame selected, that [category] value as their most recent category.

For the grand total we get every [clientid] in context.

This will probably not have a very good performance curve.

Here's a sample workbook to play with the functioning measure defined.

Edit

Based on replies below.

Do you need to maintain in the model all the rows that have [UseClient] <> 1? Deduping and flagging is always easier in tools other than Power Pivot.

I have no idea how you've determined the values for 1 in [UseClient]. None are the most recent entry for a given [ClientID]. If you want to just flag the most recent row, which is what it sounds like you want, but not what your workbook looks like, you can do a calculated column much more easily than doing this in a measure:

=SAMPLE(
    1
    ,CALCULATETABLE( // return all dates for the [clientid] on current row
        VALUES( ClientCats[date] )
        ,ALLEXCEPT( ClientCats, ClientCats[clientid] )
    )
    ,ClientCats[date]
    ,DESC
) = ClientCats[date] // row context in table

This will return true when the value of [date] on a given row is equal to the maximum [date] for the client on that row.

One thing you could easily do in Power Query is to group [clientid] and take the max date for each [clientid]. Then you have one row per client.

This is all different than your original question, though, because your original wants to find the maxes based on date selection. But a calculated column is not updated based on filter context. It's only recalculated at model refresh time. If you're willing to use a calculated column, then just deal with your data issues before bringing it into Power Pivot.

greggyb
  • 3,728
  • 1
  • 11
  • 32
  • Wow, this is great stuff! – Stefan Helling Jan 15 '16 at 11:56
  • It worked out of the box after I changed DESC in "0" (Excel told me). As I have a lot of columns like "category" I am trying to understand the measure, really mind-bending. - I could copy the measure to use it for another column, no problem. - It doesn't work when I have to filter by more than one of these columns in the pivot table. - Is it possible to modify the measure for use in a calculated column, indicating TRUE when the row is used for counting? Then I can include this row in a filter. – Stefan Helling Jan 15 '16 at 12:10
  • Short answer to those questions is "almost definitely yes". I can't tell from such a brief description what you need. If you can share sample data, that's the easiest for me to work with. Otherwise, please illustrate your table structure like in your OP with more description so I can understand what you need better. – greggyb Jan 15 '16 at 13:52
  • Thanks again, Greg. I have about 40 Columns in my data source which are just "updated" (so I need just the newest values). Most of them I do need in the combination with category (which indicates, if the client has a single- or double-parent family). So a "calculated column"-solution would be the best, see [an updated version of your file](https://onedrive.live.com/redir?page=view&resid=A6071FAA3D284A32!430&authkey=!ACfZQewzSegAHq8) @greggyb – Stefan Helling Jan 18 '16 at 10:44
  • Awesome work, Greg - just wanted to reiterate that this sort of logic really should be done upstream in your dataset itself. – Kyle Hale Jan 20 '16 at 03:36
  • Very much agree, Kyle. – greggyb Jan 20 '16 at 13:46