fill time gaps with power query

Question

I have following data

   start        stop       status
+-----------+-----------+-----------+
| 09:01:10  | 09:01:40  |  active   |
| 09:02:30  | 09:04:50  |  active   |
| 09:10:01  | 09:11:50  |  active   |
+-----------+-----------+-----------+

I want to fill in the gaps with "passive"

   start        stop       status
+-----------+-----------+-----------+
| 09:01:10  | 09:01:40  |  active   |
| 09:01:40  | 09:02:30  |  passive  |
| 09:02:30  | 09:04:50  |  active   |
| 09:04:50  | 09:10:01  |  passive  |
| 09:10:01  | 09:11:50  |  active   |
+-----------+-----------+-----------+

How can I do this in M Query language?

chillin · Answer 1 · 2019-01-23T15:42:23.340

3

You could try something like the below (my first two steps someTable and changedTypes are just to re-create your sample data on my end):

let
    someTable = Table.FromColumns({{"09:01:10", "09:02:30", "09:10:01"}, {"09:01:40", "09:04:50", "09:11:50"}, {"active", "active", "active"}}, {"start","stop","status"}),
    changedTypes = Table.TransformColumnTypes(someTable, {{"start", type duration}, {"stop", type duration}, {"status", type text}}),
    listOfRecords = Table.ToRecords(changedTypes),
    transformList = List.Accumulate(List.Skip(List.Positions(listOfRecords)), {listOfRecords{0}}, (listState, currentIndex) =>
        let
            previousRecord = listOfRecords{currentIndex-1},
            currentRecord = listOfRecords{currentIndex},
            thereIsAGap = currentRecord[start] <> previousRecord[stop],
            recordsToAdd = if thereIsAGap then {[start=previousRecord[stop], stop=currentRecord[start], status="passive"], currentRecord} else {currentRecord},
            append = listState & recordsToAdd
        in
            append
    ),
    backToTable = Table.FromRecords(transformList, type table [start=duration, stop=duration, status=text])
in
    backToTable

This is what I start off with (at the changedTypes step):

This is what I end up with:

To integrate with your existing M code, you'll probably need to:

remove someTable and changedTypes from my code (and replace with your existing query)
change changedTypes in the listOfRecords step to whatever your last step is called (otherwise you'll get an error if you don't have a changedTypes expression in your code).

Edit:

Further to my answer, what I would suggest is:

Try changing this line in the code above:

listOfRecords = Table.ToRecords(changedTypes),

to

listOfRecords = List.Buffer(Table.ToRecords(changedTypes)),

I found that storing the list in memory reduced my refresh time significantly (maybe ~90% if quantified). I imagine there are limits and drawbacks (e.g. if the list can't fit), but might be okay for your use case.

Do you experience similar behaviour? Also, my basic graph indicates non-linear complexity of the code overall unfortunately.

Final note: I found that generating and processing 100k rows resulted in a stack overflow whilst refreshing the query (this might have been due to the generation of input rows and may not the insertion of new rows, don't know). So clearly, this approach has limits.

edited Jan 23 '19 at 15:42

answered Jan 22 '19 at 18:42

chillin

4,391
1
8
8

Note that you need to be careful when shifting index values if your data isn't sorted correctly. – Alexis Olson Jan 22 '19 at 18:47
Yes, data shown (in question) appears to be sorted, so I've made that assumption. In the event that it's not, it would need to be sorted before filling in the gaps. – chillin Jan 22 '19 at 19:00
any elaboration on performance of this method when doing this on tens of thousands of records? – intrixius Jan 22 '19 at 19:23
1

@intrixius In theory, this approach looks at each row (of your table) only once, as it does everything in one pass. It should be O(N) and be as efficient as possible (in terms of complexity). In practice, however, I don't know if there are other more efficient methods, as it's possible the `table` to `records` to `table` conversion does not scale well (depends on Power Query's internal implementation). You could try it with your own data -- or using `List.Repeat` to effectively multiply the number of rows by 10k to simulate that scale. I would do this, but I'm posting from my phone. – chillin Jan 22 '19 at 20:07
I have tested this approach with 5000 records, but canceled it after 2 minutes when it was still running... when I only kept 500 records, it took about 7 seconds; 2000 rows was about 30 seconds... – intrixius Jan 23 '19 at 07:46
I have implemented this solution in my real life dataset, and it is very slow... even for only 100 kept rows... it seems the execution time grows exponentialy with the number of records... I will try to find a way to improve speed – intrixius Jan 23 '19 at 09:54
if anyone knows where I should look at to improve performance, be my guest ;) – intrixius Jan 23 '19 at 10:22
Nice. I will test it with my real data. I also wonder how you generated these charts, is there an easy way to run all these variations (number of records and method) and gather their execution times? – intrixius Jan 23 '19 at 19:23
@intrixius I generated `N` rows using Power Query, then refreshed the query programmatically 10 times using a loop in VBA (in Excel) whilst also using VBA to measure/record the refresh times for each loop iteration (I disabled background refresh for the query to ensure VBA caught the end of the refresh correctly). I then went back and increased `N` (from 1k, to 10k, to 20k, etc.). After that, I tabulated and computed averages and plotted data on graph (again in Excel). It wasn't super scientific/accurate/controlled conditions, but the results should definitely be indicative nonetheless. – chillin Jan 23 '19 at 19:52

score 1 · Answer 2 · answered Jan 22 '19 at 18:43

I would approach this as follows:

Duplicate the first table.
Replace "active" with "passive".
Remove the start column.
Rename stop to start.
Create a new stop column by looking up the earliest start time from your original table that occurs after the current stop time.
Filter out nulls in this new column.
Append this table to the original table.

The M code will look something like this:

let
    Source = <...your starting table...>
    PassiveStatus = Table.ReplaceValue(Source,"active","passive",Replacer.ReplaceText,{"status"}),
    RemoveStart = Table.RemoveColumns(PassiveStatus,{"start"}),
    RenameStart = Table.RenameColumns(RemoveStart,{{"stop", "start"}}),
    AddStop = Table.AddColumn(RenameStart, "stop", (C) => List.Min(List.Select(Source[start], each _ > C[start])), type time),
    RemoveNulls = Table.SelectRows(AddStop, each ([stop] <> null)),
    CombineTables = Table.Combine({Source, RemoveNulls}),
    #"Sorted Rows" = Table.Sort(CombineTables,{{"start", Order.Ascending}})
in
    #"Sorted Rows"

The only tricky bit above is the custom column part where I define the new column like this:

(C) => List.Min(List.Select(Source[start], each _ > C[start]))

This takes each item in the column/list Source[start] and compares it to the time in the current row. It selects only the ones that occur after the time in the current row and then take the min over that list to find the earliest one.

this also works as expected, thanks! Since Chilling was a bit faster, I accepted his answer... Could you still elaborate which of both has a better performance? I need to do this for some tens of thousand records... — intrixius, Jan 22 '19 at 19:22
I'm guessing they'd both be fine for tens of thousands of records. His appears to have a lower computational complexity, but it's pretty hard to predict performance. Try them both and see how they handle. If they're both reasonable, then choose whichever would be easier for you to understand and maintain if you had to come back in a year and modify it. — Alexis Olson, Jan 22 '19 at 19:38
I tested this approach with more data. I had a 20.000 record table; I forgot to add a "keep top rows" filter, so it ran with all rows... unfortunately it made Excel Power Query unresponsive for about one minute; when it became responsive again it kept on loading... Afterwards I kept only 500 rows and it ran for 5 seconds; 2000 rows ran for 30 seconds, but again, excel become unresponsive for some time... — intrixius, Jan 23 '19 at 07:53

score 1 · Accepted Answer · answered Jan 23 '19 at 14:59

I think I may have a better performing solution.

From your source table (assuming it's sorted), add an index column starting from 0 and an index column starting from 1 and then merge the table with itself doing a left outer join on the index columns and expand the start column.

Remove columns except for stop, status, and start.1 and filter out nulls.

Rename columns to start, status, and stop and replace "active" with "passive".

Finally, append this table to your original table.

let
    Source = Table.RenameColumns(#"Removed Columns",{{"Column1.2", "start"}, {"Column1.3", "stop"}, {"Column1.4", "status"}}),
    Add1Index = Table.AddIndexColumn(Source, "Index", 1, 1),
    Add0Index = Table.AddIndexColumn(Add1Index, "Index.1", 0, 1),
    SelfMerge = Table.NestedJoin(Add0Index,{"Index"},Add0Index,{"Index.1"},"Added Index1",JoinKind.LeftOuter),
    ExpandStart1 = Table.ExpandTableColumn(SelfMerge, "Added Index1", {"start"}, {"start.1"}),
    RemoveCols = Table.RemoveColumns(ExpandStart1,{"start", "Index", "Index.1"}),
    FilterNulls = Table.SelectRows(RemoveCols, each ([start.1] <> null)),
    RenameCols = Table.RenameColumns(FilterNulls,{{"stop", "start"}, {"start.1", "stop"}}),
    ActiveToPassive = Table.ReplaceValue(RenameCols,"active","passive",Replacer.ReplaceText,{"status"}),
    AppendQuery = Table.Combine({Source, ActiveToPassive}),
    #"Sorted Rows" = Table.Sort(AppendQuery,{{"start", Order.Ascending}})
in
    #"Sorted Rows"

This should be O(n) complexity with similar logic to @chillin, but I think should be faster than using a custom function since it will be using a built-in merge which is likely to be highly optimized.

Nice approach, I thought of this yesterday but I thought it would only help with the initial part of determining where to insert rows. But I think your point about merge being highly optimised seems probable and it will be interesting to see if OP can confirm this. Good thinking. — chillin, Jan 23 '19 at 15:34
I've just tested this with a set of over half a million distinct start times and the query took under 30 seconds to load. And it can be done almost entirely through the GUI! — Alexis Olson, Jan 23 '19 at 16:05
Very cool, seems like your solution is far more scalable and performant. OP should change accepted solution. — chillin, Jan 23 '19 at 17:47
Let me also try this , and I will come back with my results and choose the best answer :-) — intrixius, Jan 23 '19 at 19:24
This is perfect! It runs really smooth In the mean time I have used this "Add Index(0) and Add Index(1)" principle multiple times for similar issues where I have to find the previous occurence... — intrixius, Jan 24 '19 at 20:56

fill time gaps with power query

3 Answers3

Linked