0

I am trying to fill down the missing values using Pentaho pdi.

Input:

enter image description here

Desired output:

enter image description here

Found so far only Filling data gaps in a stream in Pentaho Data Integration, is it possible? but it fills in with the last known value.

Potentially, I thought I could work with the above solution, I also added the next amount to the analytical query, along with the next date. Then, I added the flag in the clone step and filter the original results from the input into Dummy and generated results (from the calculator) to a calculator (at the moment). Then, potentially, I can dump that separate stream to a temp table in a database and run the sql query which will do the rolling subtraction. I am also investigating the javascript step.

I disregarded the Python or R Executor step because at the end I will be running the job on the aws vm and I already foresee the pain I will go through with the installation.

What would be your suggestions? Is there a simple way to do interpolation?

Updated for the question enter image description here

enter image description here

eponkratova
  • 467
  • 7
  • 20

1 Answers1

1

The method provided in your link does work from my testing, (I am using LAG instead of LEAD for your tasks though). Here I am not looking to replicate that method, just another option for you by using JavaScript to build the logic which you might also extend to other applications:

In the testing below (tested on PDI-8.0), the transformation has 5 steps, see below

enter image description here

  1. Data Grid step to create testing data with three fields: date, account number and amount
  2. Sort rows to sort the rows based on account number and date. this is required for Analytic Query step, if your source data are already sorted, then skip this step
  3. Analytic Query step, see below, create two more fields: prev_date and prev_amount

enter image description here

  1. Modified Java Script Value step, add the following code, nothing else is needed to configure in this step:

    var days_diff = dateDiff(prev_date, date, "d")
    
    if (days_diff > 0) {
        /* retrieve index for two fields: 'date', 'amount' 
         * and modify their values accordingly
         */
        var idx_date = getInputRowMeta().indexOfValue("date")
        var idx_amount = getInputRowMeta().indexOfValue("amount")
    
        /* amount to increment by each row */
        var delta_amount = (amount - prev_amount)/days_diff
    
        for (var i = 1; i < days_diff; i++) {
            newRow = createRowCopy(getOutputRowMeta().size());
            newRow[idx_date]   = dateAdd(prev_date, "d", i);
            newRow[idx_amount] = prev_amount + delta_amount * i;
            putRow(newRow);
        } 
    }
    
  2. Select values step to remove unwanted fields, i.e.: prev_date, prev_amount

Run the transformation, you will have the following shown under the Preview data tab of Modified Java Script Value step:

enter image description here

UPDATE:

Per your comments, you can do the following, assume you have a new field account_type:

  1. in Analytic Query step, add a new field prev_account_type, similar to two other prev_ fields, just from different Subject: account_type

  2. in Modified Java Script Value step, you need to retrieve the Row index for account_type and modify the logic to compute delta_amount, so when prev_account_type is not the same as the current account_type, the delta_amount is ZERO, see below code:

    var days_diff = dateDiff(prev_date, date, "d")
    
    if (days_diff > 0) {
        /* retrieve index for three fields: 'date', 'amount', 'account_type' */
        var idx_date     = getInputRowMeta().indexOfValue("date")
        var idx_amount   = getInputRowMeta().indexOfValue("amount")
        var idx_act_type = getInputRowMeta().indexOfValue("account_type")
    
        /* amount to increment by each row */
        var delta_amount = prev_account_type.equals(account_type) ? (amount - prev_amount)/days_diff : 0;
    
        /* copy the current Row into newRow and modify fields accordingly */
        for (var i = 1; i < days_diff; i++) {
            newRow = createRowCopy(getOutputRowMeta().size());
            newRow[idx_date]     = dateAdd(prev_date, "d", i);
            newRow[idx_amount]   = prev_amount + delta_amount * i;
            newRow[idx_act_type] = prev_account_type;
            putRow(newRow);
        } 
    }
    

Note: invoking Javascript interpreter does have some performance hit, so if that matters to you, stick to the method in the link you provided.

jxc
  • 13,553
  • 4
  • 16
  • 34
  • Hi @jxc, it is just perfect! Question, though, If I got a few columns I need to interpolate...If I are to add logic to your js, do I use &&? – eponkratova May 05 '19 at 01:51
  • @eponkratova, I think for more columns, just retrieve their index using `getInputRowMeta().indexOfValue(COL_NAME)`, find the delta to increase on each newRow. then set their value with `newRow[idx] = ...`, all other fields are copied into newRow otherwise. There is no need to use `&&`. make sure to set their prev_ values through the `Analytic Query` step. – jxc May 05 '19 at 02:21
  • Thank you, I will try and message you later. Just one more question, if I need to do the interpolation by account_number and product type - let's assume that I have one more column which says product type. How should I handle it? Add the product type along with the account number in the analytical query? How will your formula work for the account 0234051 > product type 31? – eponkratova May 05 '19 at 03:13
  • Continue - nitially, it had the 'start date' of 2/1/2019 and the amount of 8900 but then, the account changed the product type to 134 on 4/5/2019 and the amount of 117000. Although 4/5/2019 is the 'end date' but 117000 is not the end amount because it is a completely different product type for the same account. What should be populated instead between 2/1/2019 and 4/5/2019 is 8900. – eponkratova May 05 '19 at 03:13
  • @eponkratova, if you add both account_number and account_type into the `Group field` in the `Analytic Query` step. there will be no interpolation between 2/1/2019 and 4/5/2019 since they are in two different groups. what are you expected results? – jxc May 05 '19 at 03:32
  • In the example above, I would expect 8900 filled in every row between 2/1/2019 and 4/5/2019. Do I need to do something like - run your job. For the case like I gave to you, there will be empty rows - if my understanding is correct - so, I filter the resulted dataset into null and not null. Those that have null, I just fill down with the prev value. Then, I will append the streams. Does it make sense? – eponkratova May 05 '19 at 03:43
  • @eponkratova, this will need some adjustment on the JS code. If 1/2/2019 in the same account/type combo is 8000 (instead of 8900), you still want 8900 instead of extrapolation, correct? I will check it tomorrow. – jxc May 05 '19 at 03:58
  • 8900 is just an example. And yes, what I need is whatever amount was on 2/1/2019 - this example. It is similar to the simple filldown method. – eponkratova May 05 '19 at 04:08
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/192853/discussion-between-eponkratova-and-jxc). – eponkratova May 05 '19 at 04:46