1

I am processing data in U-SQL but not getting expected results. Here is what I am doing:

1- Select data from ADL table partitions and assign it to @data1

2- Aggregate data using Group BY and assign it to @data2

3- Truncate partitions

4- Insert data(produced in step 2) into the same table

5- Use @data2 and generate a unique GUID for every record using user
defined function and assign it to @data2
        //UDF Code
        public static Guid GetNewGuid ()
        {
        return Guid.NewGuid ();
        }

6- Select few columns from @data2 and assign it to @data3

Strangely GUIDs in @data2 and @data3 are totally different.

If I perform some joins with other datasets and change schema in Step 5 and then generate unique GUIDs then I get same GUIDS at last step. It looks like some script optimization is happening in the backend that is creating this problem.

Could you please let me know what is wrong happening in above workflow? Or if some sort of optimization is happening in the backend then how to learn how script optimization works.

Update: In this question, my focus is to learn why something calculated on one step is automatically changed in next step.

Jamil
  • 858
  • 11
  • 26
  • See [here](http://stackoverflow.com/questions/38964160/guid-newguid-always-return-same-guid-for-all-rows) for a reply from Azure Data Lake Team. Guid is not something you should be using with U-SQL to uniquely identify records because they might be "recalculated - to due vertex retries, performance optimizations, etc." – wBob May 12 '17 at 12:39
  • Possible duplicate of [Guid.NewGuid() always return same Guid for all rows](http://stackoverflow.com/questions/38964160/guid-newguid-always-return-same-guid-for-all-rows) – wBob May 12 '17 at 12:39
  • Thanks for this link. I will consider this for unique ID generation but for this question I have a different focus that I updated above. – Jamil May 16 '17 at 09:18

2 Answers2

2

Answering the updated focus - why something calculated on one step is automatically changed in the next step

wBob's excerpt answers the question completely IMO, but maybe a broader context will help.

Semantically, the answer is simply

  • Because you've violated a usage requirement of the language (determinism), the resulting behavior is undefined.

Undefined means anything can happen, so you cannot have expectations - of consistent values or otherwise. Any discussion of the actual behavior seen (different Guids) is implementation detail.

Semantic deep dive
Steps are an illusion.

  • Statements in an imperative language like C# are a sequence of exact instructions (how).

  • Statements in a functional language like U-Sql are a list of requirements, describing outputs in terms of inputs (what). The U-Sql optimizer has complete implementation flexibility in meeting those requirements. Rowsets are logical constructs to organize user requirements, they need not actually exist at implementation time; while the logic corresponding to a rowset may be split, merged, skipped, repeated, etc in implementation.

So, for example, a perfectly legal implementation of steps 5, 6 under the deterministic requirement:

@data3 = SELECT <FewCols>, GetNewGuid() AS NewGuid FROM @data2;
@data2 = SELECT *, GetNewGuid() AS NewGuid FROM @data2;
Nabeel
  • 114
  • 2
  • Could you please explain how and what is violated? I am referring to this: "Because you've violated a usage requirement of the language (determinism), the resulting behavior is undefined." – Jamil May 22 '17 at 06:26
  • **Rule**: Non-_deterministic_ logic is not allowed in U-Sql. _**Deterministic**_: same _inputs_ => same _outputs_ (i.e., repeatable) _**Inputs**_ - For UDFs, function arguments if any. _**Outputs**_ - For UDFs, function output. Functions like Guid.NewGuid don't satisfy this requirement (no input, different outputs on rerun) - so their behavior is undefined. There's some nuance (per @Michael's response - though I'm not sure whether snapshotting is a language guarantee or convenient current implementation), but the best mental model to start with: do not use non-deterministic runtime code. – Nabeel May 22 '17 at 20:41
0

Welcome to the lake Nabeel. Looking forward to more of your answers!

In addition, the behavior with non-deterministic functions is a bit complicated by the fact that for some functions, we make them deterministic (e.g., DateTime.Now()) if they appear in U-SQL but outside of a C# code block (e.g., a user-defined function), but they stay non-deterministic inside the C# code block.

The recommended way to create keys is a technique called Skolemization, where you use a deterministic key generation based on the identifying properties. That way you stay deterministic.

Michael Rys
  • 6,684
  • 15
  • 23
  • This is complicated. :) I am confused because if I alter or remove few steps from my workflow, then it works perfectly fine. For example, if I remove step 3 and 4, then it works fine. If I make some join with some other data sets before generating IDs, then it works fine. I am trying to learn how things work in the lake. – Jamil May 22 '17 at 06:31