kdb - KDB Apply logic where column exists - data validation

Question

I'm trying to perform some simple logic on a table but I'd like to verify that the columns exists prior to doing so as a validation step. My data consists of standard table names though they are not always present in each data source.

While the following seems to work (just validating AAA at present) I need to expand to ensure that PRI_AAA (and eventually many other variables) is present as well.

t: $[`AAA in cols `t; temp: update AAA_VAL: AAA*AAA_PRICE from t;()]

Two part question

This seems quite tedious for each variable (imagine AAA-ZZZ inputs and their derivatives). Is there a clever way to leverage a dictionary (or table) to see if a number of variables exists or insert a place holder column of zeros if they do not?
Similarly, can we store a formula or instructions to to apply within a dictionary (or table) to validate and return a calculation (i.e. BBB_VAL: BBB*BBB_PRICE.) Some calculations would be dependent on others (i.e. BBB_Tax_Basis = BBB_VAL - BBB_COSTS costs for example so there could be iterative issues.

Thank in advance!

Cathal O'Neill · Answer 1 · 2020-11-19T01:05:16.043

A functional update may be the best way to achieve this if your intention is to update many columns of a table in a similar fashion.

func:{[t;x]
  if[not x in cols t;t:![t;();0b;(enlist x)!enlist 0]];
  :$[x in cols t;
    ![t;();0b;(enlist`$string[x],"_VAL")!enlist(*;x;`$string[x],"_PRICE")];
    t;
   ];
 };

This function will update t with *_VAL columns for any column you pass as an argument, while first also adding a zero column for any missing columns passed as an argument.

q)t:([]AAA:10?100;BBB:10?100;CCC:10?100;AAA_PRICE:10*10?10;BBB_PRICE:10*10?10;CCC_PRICE:10*10?10;DDD_PRICE:10*10?10)
q)func/[t;`AAA`BBB`CCC`DDD]
AAA BBB CCC AAA_PRICE BBB_PRICE CCC_PRICE DDD_PRICE AAA_VAL BBB_VAL CCC_VAL DDD DDD_VAL
---------------------------------------------------------------------------------------
70  28  89  10        90        0         0         700     2520    0       0   0
39  17  97  50        90        40        10        1950    1530    3880    0   0
76  11  11  0         0         50        10        0       0       550     0   0
26  55  99  20        60        80        90        520     3300    7920    0   0
91  51  3   30        20        0         60        2730    1020    0       0   0
83  81  7   70        60        40        90        5810    4860    280     0   0
76  68  98  40        80        90        70        3040    5440    8820    0   0
88  96  30  70        0         80        80        6160    0       2400    0   0
4   61  2   70        90        0         40        280     5490    0       0   0
56  70  15  0         50        30        30        0       3500    450     0   0

As you've already mentioned, to cover point 2, a dictionary of functions might be the best way to go.

q)dict:raze{(enlist`$string[x],"_VAL")!enlist(*;x;`$string[x],"_PRICE")}each`AAA`BBB`DDD
q)dict
AAA_VAL| * `AAA `AAA_PRICE
BBB_VAL| * `BBB `BBB_PRICE
DDD_VAL| * `DDD `DDD_PRICE

And then a slightly modified function...

func:{[dict;t;x]
  if[not x in cols t;t:![t;();0b;(enlist x)!enlist 0]];
  :$[x in cols t;
    ![t;();0b;(enlist`$string[x],"_VAL")!enlist(dict`$string[x],"_VAL")];
    t;
   ];
 };

yields a similar result.

q)func[dict]/[t;`AAA`BBB`DDD]
AAA BBB CCC AAA_PRICE BBB_PRICE CCC_PRICE DDD_PRICE AAA_VAL BBB_VAL DDD DDD_VAL
-------------------------------------------------------------------------------
70  28  89  10        90        0         0         700     2520    0   0
39  17  97  50        90        40        10        1950    1530    0   0
76  11  11  0         0         50        10        0       0       0   0
26  55  99  20        60        80        90        520     3300    0   0
91  51  3   30        20        0         60        2730    1020    0   0
83  81  7   70        60        40        90        5810    4860    0   0
76  68  98  40        80        90        70        3040    5440    0   0
88  96  30  70        0         80        80        6160    0       0   0
4   61  2   70        90        0         40        280     5490    0   0
56  70  15  0         50        30        30        0       3500    0   0

Thanks once again Cathal....at present I'm writing nested conditionals to check and see if both variables exist and while it works, its quite tedious and returns table `t` if either variable do not exist: `t: $[`AAA in cols `t; $[`AAA_PRICE in cols `t; t: update AAA_Revenue: AAA_Revenue + (AAA*AAA_PRICE) from t; t];t];` for each variable. The only concern I have is for dependent calculations though I assume each function and dictionary can be discrete (i.e. tax dictionary dependent on revenue dictionary, etc.) Thank you again, I'm going to test this this morning! — Jason_L, Nov 17 '20 at 14:06

terrylynch · Accepted Answer · 2020-11-19T15:50:47.827

1

Here's another approach which handles dependent/cascading calculations and also figures out which calculations are possible or not depending on the available columns in the table.

q)show map:`AAA_VAL`BBB_VAL`AAA_RevenueP`AAA_RevenueM`BBB_Other!((*;`AAA;`AAA_PRICE);(*;`BBB;`BBB_PRICE);(+;`AAA_Revenue;`AAA_VAL);(%;`AAA_RevenueP;1e6);(reciprocal;`BBB_VAL));
AAA_VAL     | (*;`AAA;`AAA_PRICE)
BBB_VAL     | (*;`BBB;`BBB_PRICE)
AAA_RevenueP| (+;`AAA_Revenue;`AAA_VAL)
AAA_RevenueM| (%;`AAA_RevenueP;1000000f)
BBB_Other   | (%:;`BBB_VAL)

func:{c:{$[0h=type y;.z.s[x]each y;-11h<>type y;y;y in key x;.z.s[x]each x y;y]}[y]''[y];
    ![x;();0b;where[{all in[;cols x]r where -11h=type each r:(raze/)y}[x]each c]#c]};

q)t:([] AAA:1 2 3;AAA_PRICE:1 2 3f;AAA_Revenue:10 20 30;BBB:4 5 6);  
q)func[t;map]
AAA AAA_PRICE AAA_Revenue BBB AAA_VAL AAA_RevenueP AAA_RevenueM
---------------------------------------------------------------
1   1         10          4   1       11           1.1e-05
2   2         20          5   4       24           2.4e-05
3   3         30          6   9       39           3.9e-05


/if the right columns are there
q)t:([] AAA:1 2 3;AAA_PRICE:1 2 3f;AAA_Revenue:10 20 30;BBB:4 5 6;BBB_PRICE:4 5 6f);
q)func[t;map]
AAA AAA_PRICE AAA_Revenue BBB BBB_PRICE AAA_VAL BBB_VAL AAA_RevenueP AAA_RevenueM BBB_Other
--------------------------------------------------------------------------------------------
1   1         10          4   4         1       16      11           1.1e-05      0.0625
2   2         20          5   5         4       25      24           2.4e-05      0.04
3   3         30          6   6         9       36      39           3.9e-05      0.02777778

The only caveat is that your map can't have the same column name as both the key and in the value of your map, aka cannot re-use column names. And it's assumed all symbols in your map are column names (not global variables) though it could be extended to cover that

EDIT: if you have a large number of column maps then it will be easier to define it in a more vertical fashion like so:

map:(!). flip(
 (`AAA_VAL;     (*;`AAA;`AAA_PRICE));
 (`BBB_VAL;     (*;`BBB;`BBB_PRICE));
 (`AAA_RevenueP;(+;`AAA_Revenue;`AAA_VAL));
 (`AAA_RevenueM;(%;`AAA_RevenueP;1e6));
 (`BBB_Other;   (reciprocal;`BBB_VAL))
 );

edited Nov 19 '20 at 15:50

answered Nov 18 '20 at 14:25

terrylynch

11,844
13
21

Thank you Terry, I need to wrap my head around this function. I'm still not quite there in terms understanding q syntax. I believe you understand my goal. This `map` dictionary could wind up becoming quite large but the benefit would be a single repository for these basic cashflow calculations. For reference, I have ~100 lines of nested conditional updates for over 6.5mm records and takes ~1000ms. I'll test today and report back if I can get it all working.**EDIT** - I'm not using any global variables either – Jason_L Nov 19 '20 at 15:43
1

Ok. See my additional edit above regarding a large number of mappings – terrylynch Nov 19 '20 at 15:51
Oh great, that will make it much easier to build! – Jason_L Nov 19 '20 at 15:54
Terry, following up to close the loop. I did get an abbreviated dictionary of this working so thank you very much. I'll build out more cascading dictionary items to further test and see any limitations. For reference, all my data are float for reference. I'm going to break down your function to better grasp the lambda variables and the .z.s self reference. – Jason_L Nov 19 '20 at 23:32
1

Good stuff. The recursive `.z.s` logic is scanning the map for nested mappings and then replacing them with their underlying calculations until everything in the map is written in terms of the base calculations on the actual/existing table columns. – terrylynch Nov 20 '20 at 10:06
Terry, if we wanted to add additional variables how would you recommend proceeding? Assuming there would be more than 2 variables per calculation? For example I'm creating a `TotalRevenue` Equivalent. Something akin to `(`TotalRevenue; (+;`AAA;`BBB;`CCC;`DDD`EEE.....etc));` This of course assumes that all variables exist otherwise it returns nothing. Thx in advance! – Jason_L Dec 08 '20 at 20:21
1

Look at how parse reads it and work off of that: `0N!parse"select totalR:AAA+BBB+CCC+DDD+EEE from tab";`. In this case of nested sums you could generate the nesting using something like ```(1#`TotalRevenue)!enlist{(+;y;x)}/[reverse`AAA`BBB`CCC`DDD`EEE]``` – terrylynch Dec 09 '20 at 10:18
Thanks again Terry, I've seen some a few examples of functional form and have written a few bits of test code to test `parse` but I hadn't used it in anger yet. Very helpful once again! – Jason_L Dec 09 '20 at 14:46

kdb - KDB Apply logic where column exists - data validation

2 Answers2