I receive a daily file that forecasts values for given categories. Where FileDate = FcstDate, the value FcstVal is actually the real, actual value. Right now I'm using Excel Power Query (XL'16: Get & Transform) to easily pull dozens of files together into a table that resembles the one below (400k+ rows, 18 levels in reality).
I need to be able to say, On 1-1, 1-2 Category AA|AC|null was forecast to be 60, 44 respectively on 1-3, but the actual value was 43. Ditto for every other row. Most, but not all, unique row combinations are common between files. Eventually I'll have to worry about dealing with renamed levels...
The Table.Partition, Table.FillUp, Table.FromPartitions Power Query functions express the logic perfectly, but Power Query is far too slow because it seems to read each very large .xlsx file multiple times (+1x per row?!), made worse because I'd need an index table with all distinct Category Levels & Forecast Dates to partition on.
Right now I'm reduced to using this formula in an excel table:
=SUMIFS([ActualVal], [Lvl1],[@[Lvl1]], [Lvl2],[@[Lvl2]], [Lvl3],[@[Lvl3]], [FileDt]],[@[FcstDt]], [@[Eq]]="Y")
However, this requires setting all blanks to 'null', changing values that start with "=" or ">", etc. and takes hours to calculate.
I've been trying to learn PowerPivot/DAX because I understand it's able to efficiently filter & calculate large data sets. I'm hoping for a solution that will set the 'context' of a DAX calculation to the same row I reference via the old-fashioned excel formula & move the value into my 'wanted' column - but I haven't figured it out.
I'd very much prefer a PowerPivot solution if possible, but if not, I can sometimes make sense of python/pandas. However, we're stuck with Excel input from a 3rd party provider.
Lvl1 | Lvl2 | Lvl3 | FileDt | FcstDt | Eq | FcstVal | ActualVal | Wanted! 1-1: ________________________________________________________________________ AA AB AD 1-1 1-1 Y 100 100 100 AA AC AE 1-1 1-1 Y 50 50 50 AA AB (null) 1-1 1-2 110 105 AA AC (null) 1-1 1-2 (null) 45 AA AB (null) 1-1 1-3 120 105 AA AC (null) 1-1 1-3 70 43 1-2 file: ___________________________________________________________________ AA AB (null) 1-2 1-2 Y 105 105 105 AA AC (null) 1-2 1-2 Y 45 45 45 AA AB (null) 1-2 1-3 113 (null) AA AC (null) 1-2 1-3 44 43 1-3 file: ___________________________________________________________________ (missing row AA|AB!) 1-3 1-3 Y (null) (null) (null) AA AC (null) 1-3 1-3 Y 43 43 43 AA AB (null) 1-3 1-4 108 (null) AA AC (null) 1-3 1-4 42 (null)
EDIT:
I'll share my code because some parts might be useful to others, and my problem might be in the other parts.
My strategy is to load a set of workbooks as per a table in the open Excel called . I apply a simple function to extract the table I want from the workbook contents, and then also apply a function to do as much processing as possible on the tables while still separate , thinking that multithreading may be better leveraged because they're still independent (is that right?).
This ends the first Query: . I would prefer to stop here and use PowerPivot if it can do the rest (with a final Table.Combine if necessary).
In Power Query I then have to combine the tables - twice. The first has all fields, while the second is a Distinct set of grouping fields from all tables (without the value or As-of Date fields). A single (i.e. first) table cannot be used because grouping combinations may exist in later tables that aren't in the first, & vice versa. This Distinct Table gets an index.
I join the second to the first via Table.NestedJoin & extract only the index from the joined column. This allows me to divide the data into partitions that have only the same Forecast Date & Groups. Here I can FillDown because the tables were pre-sorted in descending order by as of date in the Prep_Data_Table function, so the Actual Value, if any flows down to others of the same group, and no further.
After this, simply recombine the tables.
CODE:
FieldMetadata holds data type & ordering information for the fields. Sources holds the pathnames & whether or not to load the specified file.
ImportParameters:
[ThisWB = Excel.CurrentWorkbook()
Sources = ThisWB{[Name="Sources"]}[Content],
FieldMetadata = ThisWB{[Name="FieldMetadata"]},
FieldTypes = Table.ToRows(GetCfg({"Type"})),
CategoryFields = List.Buffer(List.Transform(List.Select(List.Transform(FieldTypes, each {List.First(_), TypeFromString(List.Last(_))}), each List.Last(_) = type text), each List.First(_))),
CategoryFieldTypes = List.Buffer(List.Transform(FieldTypes, (_) => {List.First(_), TypeFromString(List.Last(_))}))
GetCfg:
let
Cfg = (Columns as list) as table =>
let
FilterList = List.Transform(Columns, each "[" & _ & "]" <> null"),
ExpressionText = Text.Combine(FilterList, " and "),
Source = Excel.CurrentWorkbook(){Name="FieldMetadata"]}[Content],
#"Changed Type" = Table.TransformColumnTypes(Source, {{"Field", type text}, {"Type", type text"}, {"Grouping", Int32.Type}, {"Presentation"}, Int32.Type}}),
Custom1 = Table.SelectColumns(#"Changed Type", List.Combine({{"Field"}, Columns})),
#"Filtered Rows" = Table.SelectRows(Custom1, each Expression.Evaluate(ExpressionText, [_=_]))
/* The above line is a bit of a mind bender. It lets me apply filteres without hard-coding column names. Very useful.
Credit to http://www.thebiccountant.com/2016/03/08/select-rows-that-have-no-empty-fields-using-expression-evaluate-in-power-bi-and-power-query/
*/
in
#"Filtered Rows"
in
Cfg
FieldSortOrder
let
SortOn = (SortOn as text) as list =>
let
Source = ImportParameters[FieldMetadata],
#"Changed Type" = Table.TransformColumnTypes(Source, {{"Field", type text}, {"Grouping", type number}}),
SelectedSort = Table.SelectXolumns(Source, {"Field", SortOn}),
RenamedSortColumn = Table.RenameColumns(SelectedSort, {{SortOn, "Sort"}}),
NoNulls = Table.SelectRows(RenamedSortColumn, each ([Sort] <> null)),
SortedFields = Table.Sort(NoNulls, {{"Sort", Order.Ascending}})[Field]
in
SortedFields
in
SortOn
TypeFromString
let
Type = (TypeName as text) as type =>
let
TypeNameFix = if TypeName = "Table" then "_Table" else TypeName, // because Table is a reserved word
TypR = [Any=Any.Type,
Binary=Binary.Type, // The whole list of types I could find.
...
_Table=Table.Type,
...
WebMethod=WebMethod.Type],
TheType = try Record.Field(TypR, TypeNameFix) otherwise error [Reason="TypeName not found", Message="Parameter was not found among the list of types defined within the TypeFromString function.",
in
TheType
in
Type
Extract_Data_Table:
let
Source = (Table as table) as table =>
let
#"Filtered Rows" = Table.SelectRows(Table, each ([Kind] = "Table" and ([Item] = "Report Data" or [Item] = "Report_Data"))),
#"Select Columns" = Table.SelectColumns(#"Filtered Rows", "Data"),
DataTable = #"Select Columns"[Data]{0}
in
DataTable
in
Source
Prep_Data_Table:
let
PrepParams = (HorizonEnd as date, CategoryFieldTypes as list) as function =>
let
HorizonEnd = HorizonEnd,
CategoryFieldTypes = List.Buffer(CategoryFieldTypes),
Source = (InputTable as table, FileDate as date) as table =>
let
EndFields = {"As-of Date", "PERIOD", "Actual", "Forecast"} as list,
PeriodsAsDates = Table.TransformColumnTypes(InputTable, {{"PERIOD", type date}}),
#"Remove Errors" = Table.RemoveRowsWithErrors(PeriodsAsDates, {"PERIOD"}),
WithinHorizon = Table.SelectRows(#"Remove Errors", each ([PERIOD] <= HorizonEnd)),
RenamedVAL = Table.RenameColumns(WithinHorizon, {"VAL", "Forecast"}), // Forecast was originally named VAL
MovedActual = Table.AddColumn(RenamedVAL, "Actual", each if [PERIOD]=FileDate then (if [Forecast] is null then 0 else [Forecast]) else null),
IncludesOfDate = Table.AddColumn(MovedActual, "As-of Date", each FileDate, Date.Type),
AppliedCategoryFieldTypes = Table.TransformColumnTypes(IncludeAsOfDate, CategoryFieldTypes),
TransformedColumns = Table.TransformColumns(AppliedCategoryFieldTypes, {{"{Values}", Text.Trim, type text}, {"Actual", Number.Abs, Currency.Type}, {"Forecast", Number.Abs, Currency.Type}}),
Sorted = Table.Sort(TransformedColumns, {{"Actual", Order.Descending}}), // Descending order is important because Table.FillDown is more efficient than Table.FillUp
OutputTable = Table.SelectColumns(Sorted, List.Distinct(List.Combine({List.Transform(CategoryFieldTypes, each List.First(_)), EndFields}))),
Output = OutputTable
in
Output
in
Source
in
PrepParams
Workbooks:
let
// Import Data
Source = ImportParameters[Sources],
#"Changed Type" = Table.TransformColumnTypes(Source, {{"As-of Date", type date}, {"Folder Path", type text}, {"Tab", type text}, {"Load", type logical}}),
#"Filtered Rows"= Table.SelectRows(#"Changed Type", each ([Load] = true)),
WorkbookPaths = Table.AddColumn(#"Filtered Rows", "File Path", each [Folder Path] & [File], type text),
LoadWorkbooks = Table.AddColumn(WorkbookPaths, "Data", each Excel.Workbook(File.Contents([File Path])) meta [#"As-of Date" = [#"As-of Date"]]),
LoadDataTables = Table.TransformColumns(LoadWorkbooks, {"Data", each Extract_Data_Table(_) meta [#"As-of Date" = Value.Metadata(_)[#"As-of Date"]]}),
PrepFunc = Prep_Data_Table(List.Max(LoadDataTables[#"As-of Date"]), ImportParameters[CategoryFieldTypes]),
// This TransformColumns step references the column's list, not the table, so the As-of Date field of the column is out of scope. Use metadata to bring the As-of Date value into the same scope
PrepDataTables = Table.TransformColumns(LoadDataTables, {"Data", each Table.Buffer(PrepFunc(_, Value.Metadata(_)[#"As-of Date"]))}),
Output = Table.SelectColumns(PrepDataTables, {"Data", "As-of Date"})
in
Output
MakeComparison:
let
CategoryFields = ImportParameters[CategoryFields]
DataTableList = Workbooks[Data],
CategoryIndex = Table.AddIndexColumn(Table.Distinct(Table.Combine(List.Transform(DataTableList, each Table.SelectColumns(_, CategoryFields)))), "Index"),
ListOfDataTablesWithNestedIndexTable = List.Transform(DataTableList, each Table.NestedJoin(_, CategoryFields, CategoryIndex, CategoryFields, "Index", JoinKind.Inner)),
ListOfIndexedDataTables = List.Transform(ListOfDataTablesWithNestedIndexTable, each Table.TransformColumns(_, {"Index", each List.Single(Table.Column(_, "Index")) as number, type number})),
Appended = Table.Combine(ListOfIndexedDataTables),
Merged = Table.Join(CategoryIndex, "Index", Table.SelectColumns(Appended, {"As-of Date", "Actual", "Forecast", "Index"}), "Index"),
Partitioned = Table.Partition(Merged, "Index", Table.RowCount(CategoryIndex), each _),
CopiedActuals = List.Transform(Partitioned, each Table.FillDown(_, {"Actual"})),
ToUnpartition = List.Transform(CopiedActuals, each {List.First(_[Index]), Table.RemoveColumns(_, {"Index"})}),
UnPartitioned = Table.FromPartitions("Index", ToUnpartition, type number),
Output = Unpartitioned
in
Output
Question: Does qualify as a Closure?
Question: Does it matter whether I use Table.FromPartitions or simply Table.Combine to recombine the Tables? What's the difference?
Question: What does Fast Data Load really do? When does it / does it not make a difference?
Question: Is there any performance benefit to specifying the type of everything (x as table, y as list, z as number, etc.)?
Question: I read in some documentation that let..in is just syntactic sugar around records. I've begun to prefer records because all intermediate values are available. Any performance implications?
Question: Any difference between number types? Int32.Type vs Int64.Type?