0

I have a workbook that imports data via Power Query from 2 large files (QueryA = 11,000 rows and QueryB = 130,000 rows).

When I join the QueryA[ID] with QueryB[ID] I get back around 11,000 rows relatively quickly. Unfortunately I found duplicate values in QueryB for [ID] which creates duplicate rows in QueryA so I need to filter QueryB based on the max [Date] such that only 1 row of data (the row for the Max Date for each given ID) in QueryB when the join to QueryA occurs.

I inserted the following statement in QueryB.

FilterOutDups = Table.Select(PriorStep, each ([Date] = List.Max(let CurID = [ID] in Table.SelectRows(PriorStep, each [ID] = CurID)[Date])))

This step gives me back the results as expected (1 row/ID with the values related to the Max Date for that ID).

However, when I join the result of QueryB (with the above statement included) to QueryA the result never finishes in an hour and the connections window is showing 125GB and counting when I kill Excel to stop it.

I'd greatly appreciate anyone who can identify why this one statement adds soooo much time to QueryA and how I can alter the code to get the results needed.

Thanks.

Ron Rosenfeld
  • 53,870
  • 7
  • 28
  • 60
  • Not enough info to recreate your problem. The slowdown is likely involved in Table.SelectRows It may be faster to group by ID and aggregate (return) the row that contains the maximum date. – Ron Rosenfeld Oct 12 '21 at 10:42

2 Answers2

0

The way you have it written, your filter condition has to be evaluated individually for each row of your table. As you've noticed, this is inefficient performs poorly when the number of rows gets very big.

The solution is, as @Ron suggests, is to group QueryB by [ID], taking the max over the [Date] column before joining it with QueryA. Query engines are much better at this sort of operation than dealing with rows individually.

Alexis Olson
  • 38,724
  • 7
  • 42
  • 64
  • Thanks. So if I group I know how to get the Max Date but how do I get the 4-5 other fields I need that are associated with the Max Date row for the given ID? – user11632362 Oct 12 '21 at 16:26
  • Are they different for different dates or can you just use max on them as well? – Alexis Olson Oct 12 '21 at 16:29
  • Different values in different columns associated with the Max Date row that I need to pull so I can't use Max on those other columns as part of the Group By. – user11632362 Oct 13 '21 at 09:17
  • Then you need an extra step. Do the group by on [ID] taking max over the [Date] and then merge this result with the original table (matching in [ID] and [Date]) and expand the other fields you need. See my answer [here](https://stackoverflow.com/questions/51437962) for a bit more detail. – Alexis Olson Oct 13 '21 at 14:50
  • 1
    Thanks again. I found a solution here. https://www.ehansalytics.com/blog/2020/7/16/return-row-based-on-max-value-from-one-column-when-grouping It runs lightning fast and only requires 1 step. – user11632362 Oct 13 '21 at 18:24
  • Ooh. I like that. It might not be optimal if you have a bunch of columns but for just a few, it's pretty slick. – Alexis Olson Oct 13 '21 at 18:46
0

I found this elegant and very fast solution via https://www.ehansalytics.com/blog/2020/7/16/return-row-based-on-max-value-from-one-column-when-grouping

My 11,000 rows in QueryA and 130,000 rows in QueryB returns in about 10 seconds. Here is the code.

let Source = whatever GroupRows = Table.Group( Source, {"ID"}, { {"Max Date", each List.Max([Date])}, {"Other Field1", each Table.Max(, "Date")[Other Field1]}, {"Other Field2", each Table.Max(, "Date")[Other Field2]} } ) in GroupRows