0

I have a very large table, millions of rows, with many columns, amongst them, Service, Start, and End. Service is a name, Start and End indicate outage times.

The table looks something like:

 Service | Start     | End
 LAN     | 1/1 12:00 | 3/1 12:00
 LAN     | 2/1 14:00 | 3/1 14:00
 WAN     | 5/1 10:00 | 7/1 08:00
 WAN     | 6/1 08:00 | 7/1 10:00

The aim is to create an aggregate table listing Service, date and outage length, something like:

 Service | Date      | Outage length
 LAN     | 1/1       | 12 h
 LAN     | 2/1       | 24 h
 LAN     | 3/1       | 14 h
 WAN     | 5/1       | 14 h
 WAN     | 6/1       | 24 h
 WAN     | 7/1       | 10 h

My workflow is as follows:

Step 1. To calculate overlapping times I have used this Stack Overflow solution. It works fine. Based on my input table, the output looks like:

 Service | Start     | End
 LAN     | 1/1 12:00 | 3/1 14:00
 WAN     | 5/1 10:00 | 6/1 10:00

Step 2. I've added a Date column, and a calculated Outage column to get the final result.

It works, big thanks to @horseyride for the original solution, but is it very slow and takes forever to run. I realised that there are a lot of records but I'm willing to bet that there is a better workflow algorithm which could speed up the process dramatically.

jonrsharpe
  • 115,751
  • 26
  • 228
  • 437
Michal Rosa
  • 2,459
  • 2
  • 15
  • 28

1 Answers1

1

You may be better off with the OVERLAP function in powerbi

If it helps, I think this works considerably (?) faster then the [Original Powerquery Answer][1] by (a) removing half of the initial date comparisons checks (b) grouping and only operating on one group at a time. Add or remove the Table.Buffers as needed, not sure they are helping here

function process

(xtable)=>
// compare each list against all lists in column Custom, and merge those that overlap
let Source= Table.Buffer(xtable),
#"Added Custom"= Table.AddColumn( 
     Source,
    "Custom2",
        each List.Accumulate (
         Source[Custom],
         [Custom],
        (state,current)=> if List.ContainsAny(state,current) then List.Distinct(List.Combine({current,state})) else state
    )
),
// count the number of changes made from original version. If this is not zero, we will recurse the changes
x= List.Sum(List.Transform(List.Positions(#"Added Custom"[Custom]), each if #"Added Custom"[Custom]{_} = #"Added Custom"[Custom2]{_} then 0 else 1)),
RemovePrioCustom= Table.RemoveColumns(#"Added Custom",{"Custom"}),
AddNewCustom= Table.RenameColumns(RemovePrioCustom,{{"Custom2", "Custom"}}),
recursive = if x=0 then AddNewCustom else @process( AddNewCustom)
in recursive

Query

let Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
#"Changed Type" = Table.TransformColumnTypes(Source,{{"Start", type datetime}, {"End", type datetime}}),
#"Grouped Rows" = Table.Group(#"Changed Type" , {"Group"}, {{"data", each 
let  #"AddIndex" = Table.AddIndexColumn(_, "Index", 0, 1, Int64.Type),
#"Round1" = Table.AddColumn(#"AddIndex", "Custom",  (i)=>Table.SelectRows(#"AddIndex", each 
    ([End]=null and (i[End]=null or [Start]<=i[End])) or //remove this row if you don't care about null dates to speed things up
     ([Start]>=i[Start] and [Start]<=i[End]) or 
    ([Start]<=i[Start] and [End]>=i[Start])
     )[Index]),
#"Round2"= process(#"Round1"),
#"GetStart" = Table.AddColumn(#"Round2", "StartMin", each List.Min(List.Transform([Custom], each #"Round2"[Start]{_})),type datetime),
#"GetEnd"= Table.AddColumn(#"GetStart", "EndMax", each 
    let a=List.Transform([Custom], each #"GetStart"[End]{_})
    in if List.Count(a)-List.NonNullCount(a) > 0 then null else List.Max(a)),
#"RemoveColumns" = Table.RemoveColumns(#"GetEnd",{"Start", "End", "Index", "Custom"}),
#"CleanUp" = Table.Distinct(#"RemoveColumns", {"Group", "StartMin", "EndMax"})
in #"CleanUp", type table }}),
#"Expanded data" = Table.ExpandTableColumn(#"Grouped Rows", "data", {"Event ID", "StartMin", "EndMax"}, {"Event ID", "StartMin", "EndMax"})
in #"Expanded data"
horseyride
  • 17,007
  • 2
  • 11
  • 22
  • I wish I could upvote it twice. Thank you sir, may the deity of choice bless you with multiple pleasures. One more question, what OVERLAP function are you referring to? I'm keen to explore more options with the dashboard I'm working on. – Michal Rosa May 13 '23 at 00:12
  • How much faster was the new version? As for overlap, Im a PQ not PowerBI kind of guy, perhaps someone else will help out – horseyride May 13 '23 at 00:19
  • The previous version never actually managed to fully load and this one takes about 5 minutes so if my maths is correct it's actually infinitely faster! One note, it doesn't affect my data but I spotted it during testing, if a given even had no end date (it's NULL) the output table removes the start time as well. As my outages will always have end date it doesn't affect me, but might be something that could be fixed in a future versions. Thanks again, much appreciated! – Michal Rosa May 13 '23 at 00:25
  • If it matters, I think the edit above handles nulls – horseyride May 13 '23 at 18:23