2

I receive a daily file that forecasts values for given categories. Where FileDate = FcstDate, the value FcstVal is actually the real, actual value. Right now I'm using Excel Power Query (XL'16: Get & Transform) to easily pull dozens of files together into a table that resembles the one below (400k+ rows, 18 levels in reality).

I need to be able to say, On 1-1, 1-2 Category AA|AC|null was forecast to be 60, 44 respectively on 1-3, but the actual value was 43. Ditto for every other row. Most, but not all, unique row combinations are common between files. Eventually I'll have to worry about dealing with renamed levels...

The Table.Partition, Table.FillUp, Table.FromPartitions Power Query functions express the logic perfectly, but Power Query is far too slow because it seems to read each very large .xlsx file multiple times (+1x per row?!), made worse because I'd need an index table with all distinct Category Levels & Forecast Dates to partition on.

Right now I'm reduced to using this formula in an excel table: =SUMIFS([ActualVal], [Lvl1],[@[Lvl1]], [Lvl2],[@[Lvl2]], [Lvl3],[@[Lvl3]], [FileDt]],[@[FcstDt]], [@[Eq]]="Y") However, this requires setting all blanks to 'null', changing values that start with "=" or ">", etc. and takes hours to calculate.

I've been trying to learn PowerPivot/DAX because I understand it's able to efficiently filter & calculate large data sets. I'm hoping for a solution that will set the 'context' of a DAX calculation to the same row I reference via the old-fashioned excel formula & move the value into my 'wanted' column - but I haven't figured it out.

I'd very much prefer a PowerPivot solution if possible, but if not, I can sometimes make sense of python/pandas. However, we're stuck with Excel input from a 3rd party provider.

    Lvl1 | Lvl2 | Lvl3 | FileDt | FcstDt | Eq | FcstVal | ActualVal | Wanted!
1-1: ________________________________________________________________________
     AA    AB     AD      1-1       1-1    Y     100        100          100
     AA    AC     AE      1-1       1-1    Y      50         50           50
     AA    AB    (null)   1-1       1-2          110                     105
     AA    AC    (null)   1-1       1-2         (null)                    45
     AA    AB    (null)   1-1       1-3          120                     105
     AA    AC    (null)   1-1       1-3           70                      43
1-2 file: ___________________________________________________________________
     AA    AB    (null)   1-2       1-2    Y     105        105          105
     AA    AC    (null)   1-2       1-2    Y      45         45           45
     AA    AB    (null)   1-2       1-3          113                   (null)
     AA    AC    (null)   1-2       1-3           44                      43
1-3 file: ___________________________________________________________________
 (missing row AA|AB!)     1-3       1-3    Y    (null)    (null)       (null)
     AA    AC    (null)   1-3       1-3    Y      43         43           43
     AA    AB    (null)   1-3       1-4          108                   (null)
     AA    AC    (null)   1-3       1-4           42                   (null)

EDIT:

I'll share my code because some parts might be useful to others, and my problem might be in the other parts.

My strategy is to load a set of workbooks as per a table in the open Excel called . I apply a simple function to extract the table I want from the workbook contents, and then also apply a function to do as much processing as possible on the tables while still separate , thinking that multithreading may be better leveraged because they're still independent (is that right?).

This ends the first Query: . I would prefer to stop here and use PowerPivot if it can do the rest (with a final Table.Combine if necessary).

In Power Query I then have to combine the tables - twice. The first has all fields, while the second is a Distinct set of grouping fields from all tables (without the value or As-of Date fields). A single (i.e. first) table cannot be used because grouping combinations may exist in later tables that aren't in the first, & vice versa. This Distinct Table gets an index.

I join the second to the first via Table.NestedJoin & extract only the index from the joined column. This allows me to divide the data into partitions that have only the same Forecast Date & Groups. Here I can FillDown because the tables were pre-sorted in descending order by as of date in the Prep_Data_Table function, so the Actual Value, if any flows down to others of the same group, and no further.

After this, simply recombine the tables.

CODE:

FieldMetadata holds data type & ordering information for the fields. Sources holds the pathnames & whether or not to load the specified file.

ImportParameters:

[ThisWB = Excel.CurrentWorkbook()
Sources = ThisWB{[Name="Sources"]}[Content],
FieldMetadata = ThisWB{[Name="FieldMetadata"]},
FieldTypes = Table.ToRows(GetCfg({"Type"})),
CategoryFields = List.Buffer(List.Transform(List.Select(List.Transform(FieldTypes, each {List.First(_), TypeFromString(List.Last(_))}), each List.Last(_) = type text), each List.First(_))),
CategoryFieldTypes = List.Buffer(List.Transform(FieldTypes, (_) => {List.First(_), TypeFromString(List.Last(_))}))

GetCfg:

let
    Cfg = (Columns as list) as table =>
let
    FilterList = List.Transform(Columns, each "[" & _ & "]" <> null"),
    ExpressionText = Text.Combine(FilterList, " and "),
    Source = Excel.CurrentWorkbook(){Name="FieldMetadata"]}[Content],
    #"Changed Type" = Table.TransformColumnTypes(Source, {{"Field", type text}, {"Type", type text"}, {"Grouping", Int32.Type}, {"Presentation"}, Int32.Type}}),
    Custom1 = Table.SelectColumns(#"Changed Type", List.Combine({{"Field"}, Columns})),
    #"Filtered Rows" = Table.SelectRows(Custom1, each Expression.Evaluate(ExpressionText, [_=_]))
        /* The above line is a bit of a mind bender. It lets me apply filteres without hard-coding column names. Very useful.
           Credit to http://www.thebiccountant.com/2016/03/08/select-rows-that-have-no-empty-fields-using-expression-evaluate-in-power-bi-and-power-query/
        */
in
    #"Filtered Rows"
in
    Cfg

FieldSortOrder

let
    SortOn = (SortOn as text) as list =>
let
    Source = ImportParameters[FieldMetadata],
    #"Changed Type" = Table.TransformColumnTypes(Source, {{"Field", type text}, {"Grouping", type number}}),
    SelectedSort = Table.SelectXolumns(Source, {"Field", SortOn}),
    RenamedSortColumn = Table.RenameColumns(SelectedSort, {{SortOn, "Sort"}}),
    NoNulls = Table.SelectRows(RenamedSortColumn, each ([Sort] <> null)),
    SortedFields = Table.Sort(NoNulls, {{"Sort", Order.Ascending}})[Field]
in
    SortedFields
in
    SortOn

TypeFromString

let
    Type = (TypeName as text) as type =>
let
    TypeNameFix = if TypeName = "Table" then "_Table" else TypeName, // because Table is a reserved word
TypR = [Any=Any.Type,
        Binary=Binary.Type, // The whole list of types I could find.
        ...
        _Table=Table.Type,
        ...
        WebMethod=WebMethod.Type],
    TheType = try Record.Field(TypR, TypeNameFix) otherwise error [Reason="TypeName not found", Message="Parameter was not found among the list of types defined within the TypeFromString function.",
in
    TheType
in
    Type

Extract_Data_Table:

let
    Source = (Table as table) as table =>
let
    #"Filtered Rows" = Table.SelectRows(Table, each ([Kind] = "Table" and ([Item] = "Report Data" or [Item] = "Report_Data"))),
    #"Select Columns" = Table.SelectColumns(#"Filtered Rows", "Data"),
    DataTable = #"Select Columns"[Data]{0}
in
    DataTable
in
    Source

Prep_Data_Table:

let
    PrepParams = (HorizonEnd as date, CategoryFieldTypes as list) as function =>
let
    HorizonEnd = HorizonEnd,
    CategoryFieldTypes = List.Buffer(CategoryFieldTypes),
    Source = (InputTable as table, FileDate as date) as table =>
let
    EndFields = {"As-of Date", "PERIOD", "Actual", "Forecast"} as list,
    PeriodsAsDates = Table.TransformColumnTypes(InputTable, {{"PERIOD", type date}}),
    #"Remove Errors" = Table.RemoveRowsWithErrors(PeriodsAsDates, {"PERIOD"}),
    WithinHorizon = Table.SelectRows(#"Remove Errors", each ([PERIOD] <= HorizonEnd)),
    RenamedVAL = Table.RenameColumns(WithinHorizon, {"VAL", "Forecast"}), // Forecast was originally named VAL
    MovedActual = Table.AddColumn(RenamedVAL, "Actual", each if [PERIOD]=FileDate then (if [Forecast] is null then 0 else [Forecast]) else null),
    IncludesOfDate = Table.AddColumn(MovedActual, "As-of Date", each FileDate, Date.Type),
    AppliedCategoryFieldTypes = Table.TransformColumnTypes(IncludeAsOfDate, CategoryFieldTypes),
    TransformedColumns = Table.TransformColumns(AppliedCategoryFieldTypes, {{"{Values}", Text.Trim, type text}, {"Actual", Number.Abs, Currency.Type}, {"Forecast", Number.Abs, Currency.Type}}),
    Sorted = Table.Sort(TransformedColumns, {{"Actual", Order.Descending}}), // Descending order is important because Table.FillDown is more efficient than Table.FillUp
    OutputTable = Table.SelectColumns(Sorted, List.Distinct(List.Combine({List.Transform(CategoryFieldTypes, each List.First(_)), EndFields}))),
    Output = OutputTable
in
    Output
in
    Source
in
    PrepParams

Workbooks:

let
// Import Data
    Source = ImportParameters[Sources],
    #"Changed Type" = Table.TransformColumnTypes(Source, {{"As-of Date", type date}, {"Folder Path", type text}, {"Tab", type text}, {"Load", type logical}}),
    #"Filtered Rows"= Table.SelectRows(#"Changed Type", each ([Load] = true)),
    WorkbookPaths = Table.AddColumn(#"Filtered Rows", "File Path", each [Folder Path] & [File], type text),
    LoadWorkbooks = Table.AddColumn(WorkbookPaths, "Data", each Excel.Workbook(File.Contents([File Path])) meta [#"As-of Date" = [#"As-of Date"]]),
    LoadDataTables = Table.TransformColumns(LoadWorkbooks, {"Data", each Extract_Data_Table(_) meta [#"As-of Date" = Value.Metadata(_)[#"As-of Date"]]}),
    PrepFunc = Prep_Data_Table(List.Max(LoadDataTables[#"As-of Date"]), ImportParameters[CategoryFieldTypes]),
    // This TransformColumns step references the column's list, not the table, so the As-of Date field of the column is out of scope. Use metadata to bring the As-of Date value into the same scope

    PrepDataTables = Table.TransformColumns(LoadDataTables, {"Data", each Table.Buffer(PrepFunc(_, Value.Metadata(_)[#"As-of Date"]))}),
    Output = Table.SelectColumns(PrepDataTables, {"Data", "As-of Date"})
in
    Output

MakeComparison:

let
    CategoryFields = ImportParameters[CategoryFields]
    DataTableList = Workbooks[Data],
    CategoryIndex = Table.AddIndexColumn(Table.Distinct(Table.Combine(List.Transform(DataTableList, each Table.SelectColumns(_, CategoryFields)))), "Index"),
    ListOfDataTablesWithNestedIndexTable = List.Transform(DataTableList, each Table.NestedJoin(_, CategoryFields, CategoryIndex, CategoryFields, "Index", JoinKind.Inner)),
    ListOfIndexedDataTables = List.Transform(ListOfDataTablesWithNestedIndexTable, each Table.TransformColumns(_, {"Index", each List.Single(Table.Column(_, "Index")) as number, type number})),
    Appended = Table.Combine(ListOfIndexedDataTables),
    Merged = Table.Join(CategoryIndex, "Index", Table.SelectColumns(Appended, {"As-of Date", "Actual", "Forecast", "Index"}), "Index"),
    Partitioned = Table.Partition(Merged, "Index", Table.RowCount(CategoryIndex), each _),
    CopiedActuals = List.Transform(Partitioned, each Table.FillDown(_, {"Actual"})),
    ToUnpartition = List.Transform(CopiedActuals, each {List.First(_[Index]), Table.RemoveColumns(_, {"Index"})}),
    UnPartitioned = Table.FromPartitions("Index", ToUnpartition, type number),
    Output = Unpartitioned
in
    Output

Question: Does qualify as a Closure?

Question: Does it matter whether I use Table.FromPartitions or simply Table.Combine to recombine the Tables? What's the difference?

Question: What does Fast Data Load really do? When does it / does it not make a difference?

Question: Is there any performance benefit to specifying the type of everything (x as table, y as list, z as number, etc.)?

Question: I read in some documentation that let..in is just syntactic sugar around records. I've begun to prefer records because all intermediate values are available. Any performance implications?

Question: Any difference between number types? Int32.Type vs Int64.Type?

alazyworkaholic
  • 537
  • 2
  • 8
  • Hello. If the question still valid, could you please share some (perhaps largest) files and your code? Also would you edit your post and describe logic you implemented in your code. Some actions seem excessive to me. Please also state clearly (even step-by-step) data transformations you want to do, as: 1. filter out empty rows and invalid data; 2. apply another (what?) filter; 3. do something else; etc, etc. Your question looks quite challenging! :) – Eugene Nov 30 '16 at 19:48

2 Answers2

0

How large is very large for your XLSX files? I agree with your idea that it's likely we're opening the file once per row. Given XLSX is an archive format and each sheet is a big file, seeking within the file is going to be very slow.

Especially if the total is less than half your RAM and if you're running 64-bit office, you could improve Power Query performance dramatically by calling Table.Buffer on the tables coming from XLSX.

Alternatively, if you could somehow convert your XLSX data to CSV source, then you don't pay the price to uncrack XLSX files each time. Or if you could load the data to a source like Sql Server with column indexes, that should really speed up your query. (We generally "query fold" operations to Sql Server, which has much more powerful performance heuristics in its query engine than we've created in Power Query.) It's possible you could use the Power Pivot engine for this instead, but I'm not very familiar with that.

Carl Walsh
  • 6,100
  • 2
  • 46
  • 50
  • I have plenty of RAM & tried Table.Buffer, but it seemed to have no effect at all. PQ loaded the ~20 MB of each file, but then pulled the files over & over until it told me it had downloaded over a gigabyte from each file overnight. I could see the data coming over the network monitor the whole time. I tried Table.Buffer on the Query that loads the workbooks, on each workbook that got loaded, and in desperation on just about every table object in my code at one point. I had assumed that Buffer would eagerly evaluate & freeze all data in the object in memory, but that certainly isn't happening – alazyworkaholic Oct 27 '16 at 17:03
0

One separate perf optimization: we've implemented Table.FillUp like this:

table => Reverse(FillDown(Reverse(table)))

which is pretty bad from a performance perspective. If you can perform the FillUp action once, save the data, then query over the new data, that will help query performance.

Carl Walsh
  • 6,100
  • 2
  • 46
  • 50