0

I have a U-SQL job that extracts data from 2 .tsv and 2 .csv files, selects some features and performs some simple transformations before outputting to csv/tsv files in ADL.

However, when I attempt to add further transformations within SELECT statements, it seems the job takes considerably longer to run (10+ mins vs 1 min), due to one SELECT statement in particular.

I believe it is due to the calculation of the 'YearMonth' column, where I have essentially used concatenation to get the date column to the format I need it in.

Below is the job that runs quickly:

    @StgCrime =
SELECT CrimeID,
       [Month],
       ReportedBy,
       FallsWithin,
       Longitude,
       Latitude,
       Location,
       LSOACode,
       LSOAName,
       CrimeType,
       LastOutcome,
       Context
FROM @ExtCrime;

OUTPUT @StgCrime
   TO "CrimeOutput/Crimes.csv"
     USING Outputters.Csv(outputHeader:true);

And the job that takes a lot longer:

    @StgCrime =
SELECT CrimeID,
       String.Concat([Month].Substring(0, 4),[Month].Substring(5, 2)) AS YearMonth,
       ReportedBy AS ForceName,
       Longitude,
       Latitude,
       Location,
       LSOACode,
       CrimeType,
       LastOutcome
FROM @ExtCrime;

OUTPUT @StgCrime
   TO @OCrime
     USING Outputters.Csv(outputHeader:true);

The difference in Vertex view:

Simple/Quick job

With additional transformation

Can anyone help clarify this for me? Surely that one transformation shouldn't cause such an increase in job run time?

The data file being queried is made up of 1,066 csv files, around 2.5GB in total.

Matt Lakin
  • 31
  • 5
  • Most people here want formatted text, not images (or links to images.) – jarlh Apr 17 '18 at 09:49
  • @jarlh thanks for the tip - have fixed now. You can tell this is my first post! – Matt Lakin Apr 17 '18 at 09:52
  • 1
    Looks much better now! – jarlh Apr 17 '18 at 09:54
  • It may make sense. Check how much longer with `String.Concat` locally in C# application. Without `String.Concat`, all columns just pass through. There are too many ways to optimize pass-through. – qxg Apr 18 '18 at 11:36

1 Answers1

0

Not seeing all of the script and the generated job graph as well as the number of specified AUs, it is a bit hard to estimate why one job was running slower by that much than the other.

You say that the "data file" is made up of 1066 CSV files all of which seem rather small with 2.5GB in total. I would expect that you probably get 1066 extract vertices in the extract stage. Is that the same for the simple job as well?

We have a new feature in preview that will group up to 200 files (or 1GB whatever comes first) into a single vertex to minimize the vertex startup time.

Can you try your job with the following statement added:

SET @@FeaturePreviews = "InputFileGrouping:on";
Michael Rys
  • 6,684
  • 15
  • 23