I have a U-SQL job that extracts data from 2 .tsv and 2 .csv files, selects some features and performs some simple transformations before outputting to csv/tsv files in ADL.
However, when I attempt to add further transformations within SELECT statements, it seems the job takes considerably longer to run (10+ mins vs 1 min), due to one SELECT statement in particular.
I believe it is due to the calculation of the 'YearMonth' column, where I have essentially used concatenation to get the date column to the format I need it in.
Below is the job that runs quickly:
@StgCrime =
SELECT CrimeID,
[Month],
ReportedBy,
FallsWithin,
Longitude,
Latitude,
Location,
LSOACode,
LSOAName,
CrimeType,
LastOutcome,
Context
FROM @ExtCrime;
OUTPUT @StgCrime
TO "CrimeOutput/Crimes.csv"
USING Outputters.Csv(outputHeader:true);
And the job that takes a lot longer:
@StgCrime =
SELECT CrimeID,
String.Concat([Month].Substring(0, 4),[Month].Substring(5, 2)) AS YearMonth,
ReportedBy AS ForceName,
Longitude,
Latitude,
Location,
LSOACode,
CrimeType,
LastOutcome
FROM @ExtCrime;
OUTPUT @StgCrime
TO @OCrime
USING Outputters.Csv(outputHeader:true);
The difference in Vertex view:
With additional transformation
Can anyone help clarify this for me? Surely that one transformation shouldn't cause such an increase in job run time?
The data file being queried is made up of 1,066 csv files, around 2.5GB in total.