How to merge base and multiple delta structured streams in Azure DataLake

Question

I have a base stream and I have multiple delta streams for different consecutive dates. I want to merge them to get the final stream. How can I accomplish this in Azure Datalake. For example assume these are the stream. I need to merge these streams to get the final stream. Merge would replace the original value with the new value. The number of delta stream as of now is more than 100.

Base Stream:

1022918 300.00  300.00  2   7   5   100
1022918 400.00  400.00  2   170 5   100
1022919 1000.00 1000.00 2   7   6   100
1022920 2000.00 2000.00 2   170 6   100
1022921 3000.00 3000.00 2   123 7   100
1022922 100.00  100.00  2   162 7   100
1022922 200.00  200.00  2   123 9   100
1022922 300.00  300.00  2   162 9   100

Delta Stream 1:

1022918 400.00  300.00  2   7   5   100
1022919 2000.00 1000.00 2   7   6   100
1022920 3000.00 2000.00 2   170 6   100
1022922 400.00  300.00  2   162 9   100

Delta Stream 2:

1022919 2500.00 1000.00 2   7   6   100
1022920 3500.00 2000.00 2   170 6   100

Expected Output

1022918 400.00  300.00  2   7   5   100
1022918 400.00  400.00  2   170 5   100
1022919 2500.00 1000.00 2   7   6   100
1022920 3500.00 2000.00 2   170 6   100
1022921 3000.00 3000.00 2   123 7   100
1022922 100.00  100.00  2   162 7   100
1022922 200.00  200.00  2   123 9   100
1022922 400.00  300.00  2   162 9   100

What are your expected results from these files? What is the unique key for the file? — wBob, Apr 10 '17 at 23:08
Updated the question with expected output. The primary key would be 1st Cloumn + 5th Column + 6th Cloumn. — user2580488, Apr 11 '17 at 05:52

score 2 · Answer 1 · answered Apr 10 '17 at 16:17

Azure Data Lake Store is an append-only file system. What that means is records can only be added to the end of a file. This is not much unlike many other append-only systems. Various analytics applications such as Azure Data Lake Analytics and Hive can be used to logically merge these base and delta streams

There are typically 4 steps in achieving this.

Load base data - This is straightforward. When loading base data you need to add an identifier. For example base date or base version. For this discussion lets say you use version number. Say base data is version 0
Load delta data in their own table/file - When loading them you also need to have an identifier that can be used to compare with the base data to determine the latest records. Say we also have version number. So new records will be a larger version number 1, 2, 3 and so forth.
Build a merge view - This is a query that joins base data with delta data on ID where version is the greatest. Now when you get this view, this is all the latest records.
Create new base data with records from the merge view.

Here is an article that explains how you can use Hive to achieve this. This one uses datetime to identify the latest records.

https://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/

score 1 · Accepted Answer · answered Apr 11 '17 at 10:29

I have worked up an example in U-SQL using your test data. It takes the approach as described above, and gets the last record based on the unique key as described by you and a some kind of data component.

I used these files:

There are a couple of assumptions:

files are space-delimited
filename (or could be folder) includes a date component

Code:

// Assumptions: 
//      col1, 5 and 6 is the intented unique key
//      each file includes a date in the filename

DECLARE @baseFilesLocation string = "input/base_{filedate}.txt";
DECLARE @deltaFilesLocation string = "input/delta_{filedate}.txt";


// Get the base files
@baseFiles =
    EXTRACT col1 int,
            col2 decimal,
            col3 decimal,
            col4 int,
            col5 int,
            col6 int,
            col7 int,
            filedate string

    FROM @baseFilesLocation
    USING Extractors.Text(delimiter : ' ');


// Get the delta files
@deltaFiles =
    EXTRACT col1 int,
            col2 decimal,
            col3 decimal,
            col4 int,
            col5 int,
            col6 int,
            col7 int,
            filedate string

    FROM @deltaFilesLocation
    USING Extractors.Text(delimiter : ' ', silent: true);


@working = 
    SELECT *
    FROM @baseFiles
    UNION ALL
    SELECT *
    FROM @deltaFiles;


// Work out the (col1, 5 and 6) and max filedate combination
@maxDates = 
    SELECT col1, col5, col6, MAX(filedate) AS filedate
    FROM @working
    GROUP BY col1, col5, col6;


// Join the original set (all base data, all delta files) to the max dates rowset, to get last record for each col1
@output = SELECT w.*
    FROM @working AS w
        SEMIJOIN @maxDates AS d ON w.col1 == d.col1
            AND w.col5 == d.col5
            AND w.col6 == d.col6
            AND w.filedate == d.filedate;


OUTPUT @output
    TO "/output/output.csv"
ORDER BY col1, col2, col3
USING Outputters.Csv();

My results (which match your expected results:

HTH

Hey, just tried your solution. How is filedate being set. I wanted that to be a control parameter so that I can specify the startdate for the delta files and all the delta files from startdate to currentDate are merged to base file. — user2580488, Apr 13 '17 at 12:49
As mentioned in my assumptions, this solution assumes that a date is incorporated either in the filename or the folder. If you look at my sample files, you will see they have a date in the filename, eg delta_20170103.txt. This is what's known as a [virtual column](https://msdn.microsoft.com/en-us/library/azure/mt621320.aspx) in U-SQL. — wBob, Apr 13 '17 at 13:55

How to merge base and multiple delta structured streams in Azure DataLake

2 Answers2