4

I need unique guid for every row i'm transforming from source.
below is sample script; code Guid.NewGuid() returns same always for all rows

@Person =
    EXTRACT SourceId          int,
            AreaCode          string,
            AreaDetail         string,
            City        string
    FROM "/Staging/Person"
    USING Extractors.Tsv(nullEscape:"#NULL#");

@rs1 =
    SELECT 
    Guid.NewGuid() AS PersonId,
    AreaCode,
    AreaDetail,
    City    
    FROM @Person;

OUTPUT @rs1   
    TO "/Datamart/DimUser.tsv"
      USING Outputters.Tsv(quoting:false, dateTimeFormat:null);
Blorgbeard
  • 101,031
  • 48
  • 228
  • 272
Pravin Dingore
  • 127
  • 1
  • 8

3 Answers3

7

Please note that U-SQL is a declarative language and as such will snapshot known non-deterministic functions such as Guid.NewGuid() or DateTime.Now to one value per script.

While you can work around that by wrapping such functions into a C# function, this practice is highly discouraged, since you are making the script non-deterministic, which can lead to script failures if a node in the execution has to be retried and does not produce a repeatable result!

So how can you provide a unique number?

The options are:

  1. Add the value already in the external data if you can change the data generation.
  2. Skolemization: Write a deterministic expression that combines key attributes into a unique value.
  3. Use ROW_NUMBER() OVER () on the data that you read. If you already have data that you need to guarantee uniqueness against, either add the time ticks of the time the job is run, or get the highest existing value, or get a large enough interval bump, depending on your requirements.

Here is a sample that uses the time ticks plus ROW_NUBER() to make sure that the id is unique for each row everytime you run the script since as mentioned above, U-SQL will evaluate DateTime.Now once per script invocation:

@data =
SELECT *
FROM (VALUES
      ( "John", "Doe" ),
      ( "Paul", "Miller" ),
      ( "Tracy", "Smith" ),
      ( "Jane", "Doe")
     ) AS T(firstname, lastname);

@res = 
SELECT DateTime.Now.Ticks+ROW_NUMBER() OVER () AS id, 
       firstname, lastname
FROM @data;

OUTPUT @res
TO "/output/data.csv"
USING Outputters.Csv();
Michael Rys
  • 6,684
  • 15
  • 23
5

A quick summary of the issue is that you shouldn't attempt to assign unique values through techniques that rely on generating new Guids or on any other methods with are "time-based". The reason for this is that, rows in U-SQL may be recalculated - to due vertex retries, performance optimizations, etc.

In those cases, the values will be reassigning a new value and eventually lead to an error while running a U-SQL script - because U-SQL requires that rows are deterministic with respect to input data.

Instead of as assigning a new Guid, use the ROW_NUMBER Window Function which is can safely add new unique numbers to rows. I

@result =
    SELECT 
        *,
        ROW_NUMBER() OVER () AS UID
    FROM @querylog;
saveenr
  • 8,439
  • 3
  • 19
  • 20
2

Create a udf in the code-behind:

namespace USQL_Namespace
{
    public static class Udfs
    {
        public static string newGuidString()
        {
            return Guid.NewGuid().ToString();
        }
    }

and reference it inline:

@o = 
    SELECT USQL_Namespace.Udfs.newGuidString() AS newId;
  • Michael's answer specifically says to not do this - _this practice is highly discouraged, since you are making the script non-deterministic_. Incidentally, **script failure** doesn't just refer to actual runtime failure - which at least alerts you that something is wrong - it also includes potentially incorrect output data without any warning: an invisible error (the worst kind of error!). More discussion at https://stackoverflow.com/questions/43934060/adla-job-is-not-producing-expected-results/44011762 – Nabeel May 22 '17 at 21:44