Delta Processing (Parallel execution issue)

Question

Scenario:

We have "employees_final" ADLA catalog/table.

Users from various locations would be loading employees data into "employees_final" ADLA catalog/table "in parallel" through ADF invoking U-SQL.

When users are loading the data, it goes to staging table (employees_staging) first. If there is an update to existing record, then we will have two versions of data stored in staging table (employees_staging). Next step is, we need to have "employees_final" table with latest version of record ONLY.

To create final table, we are JOINING staging and final tables to find the insert/update employees and combining the existing records with new records and RECREATE final table.

NOTE: As there is no DELETE option, we are caching the existing records and appending new/update records.

The drawback of this approach is, when users are running ADF in parallel, it will try to update SAME employees_final table and there is a chance of DATA LOSS due to TRUNCATE/RECREATE table approach.

Do we have any better way to handle PARALLEL scenario?

score 1 · Answer 1 · answered Feb 26 '18 at 14:09

I am not sure I am completely getting your question. But first some general comments on concurrent updates in ADLA/U-SQL (and many other non-transactional big data systems):

ADLA/U-SQL aims to provide behavior that is equivalent to read-committed snapshot isolation (RCSI), that allows multiple readers and 1 concurrent writer. There is currently a store level limitation that can lead to job failures even in this case, but you should not get data loss.
Multiple destructive writers (e.g., truncation) on the same object will normally lead to one of the writers to fail (currently it is the second writer).
You should be able to have multiple appends on the same object in parallel (e.g., multiple INSERT into the same table). Note though: Do not make single row insertions, since each insertion will create a table fragment file which has severe performance and scale problems.

So given these points, I do not see a problem if you have a staging table where you insert the deltas, but you will need a single job to do the merge back into the main table. Especially if you want to recreate it to handle updates and deletes and keep it from getting over-fragmented.

Delta Processing (Parallel execution issue)

1 Answers1