3

The code performs very fast over 2000 small files (~10-50 Kb) ~ 1 min. Parallelizm = 5.

@arenaData =
    EXTRACT col1, col2, col3
    FROM @in
    USING Extractors.Tsv(quoting : true, skipFirstNRows : 1, nullEscape : "\\N", encoding:Encoding.UTF8);

@res =
    SELECT col1, col2, col3  
    FROM @arenaData;
    OUTPUT @res
    TO @out
    USING Outputters.Csv();

But if i change the code like this it takes ~ 1 hour

@arenaData =
    EXTRACT col1, col2, col3
    FROM @in
    USING Extractors.Tsv(quoting : true, skipFirstNRows : 1, nullEscape : "\\N", encoding:Encoding.UTF8);

@res =
    SELECT
         col1.ToUniversalTime().ToString("yyyy-MM-dd HH:mm:ss", CultureInfo.InvariantCulture) AS col1_converted,
, col2, col3

    FROM @arenaData;
    OUTPUT @res
    TO @out
    USING Outputters.Csv();

Why the .NET call so slow? I need to convert the date format in the source CSV files to "yyyy-MM-dd HH:mm:ss"? How can i do that effectively?

Michael Rys
  • 6,684
  • 15
  • 23
churupaha
  • 325
  • 2
  • 10
  • 1
    This does not sound right. There is an additional overhead of having to load the CLR and to call from the native code into the C# execution, but that should not be 60 times worse. Could you please send me the job links (usql at Microsoft dot com) so I can ask our engineering team to investigate? – Michael Rys Feb 21 '17 at 15:43
  • 1
    I started a support ticket for ADLA support team before i posted the question to stackoverflow. i did these tests for the support team (JOB's urls below): without CLR ~2 min (MAXDOP = 5): https://arkadium.azuredatalakeanalytics.net/jobs/68d7a42a-4f66-4308-a398-3775eee74877?api-version=2015-11-01-preview the same with a one CLR call ~ 38 min (MAXDOP = 5): https://arkadium.azuredatalakeanalytics.net/jobs/4291a7e6-ed0f-4516-b677-38a432a9997c?api-version=2015-11-01-preview the timings are changed, because parameters has been changed, but the problem still exists. sooo big difference – churupaha Feb 22 '17 at 10:08
  • 1
    some more tests: the same job with CLR + parallelism increased from 5 to 20. the elapsed time ~10 min https://arkadium.azuredatalakeanalytics.net/jobs/c09a8917-3425-48df-97ea-e4a84dad3c15?api-version=2015-11-01-preview the same job with CLR + parallelism increased from 5 to 3. the elapsed time ~ 59 min + canceled by me https://arkadium.azuredatalakeanalytics.net/jobs/9168ea66-e988-4497-b661-417f1128ceac?api-version=2015-11-01-preview – churupaha Feb 22 '17 at 10:14
  • I got the root causes from the engineering team. I am currently travelling but will be answering later tonight/tomorrow. – Michael Rys Feb 22 '17 at 12:54
  • Thank you, i will wait your comment. – churupaha Feb 22 '17 at 13:29
  • 1
    @MichaelRys Looks like i understand what is going on. We have 2880 small files. For all each of them we have 1 vertex. Totally we have 2880 vertexes. I have read little bit about ADLA execution model. And looks like when Analytics Unit (AU) migrate from one vertex to another it reinitialize CLR COM-server. So in our case the CLR is reinited 2880 times. And the solution for us to merge files before start the processing. Am i right? I have tested these assumptions and looks like they are right... Wanna hear your explanation. – churupaha Feb 22 '17 at 16:20
  • @MichaelRys there are two job executions after the fix. without CLR 55 sec: https://arkadium.azuredatalakeanalytics.net/jobs/e7e49e98-5827-4acd-8e98-93b6beed336f?api-version=2015-11-01-preview with CLR call ~ 65 sec: https://arkadium.azuredatalakeanalytics.net/jobs/22bf1504-2a5b-473a-bc43-aabbffafd763?api-version=2015-11-01-preview i'm loving it! – churupaha Feb 22 '17 at 16:23
  • See my answers but you are basically right. – Michael Rys Feb 24 '17 at 11:34

1 Answers1

3

Great to hear you are getting the better performance now!

Your job runs on over 2800 very small files using an expression that is being executed in managed code and not translated into C++ as some of the more common C# expressions in U-SQL are.

This leads to the following problem:

  1. You start your job with a certain number of AUs. Each AU then starts a YARN container to execute part of your job. This means that the container needs to be initialized cleanly which takes some time (you can see it in the Vertex Execution View as creation time). Now this takes a bit of time, that is not that much overhead if your vertex does some large amount of processing. Unfortunately in your case, the processing is very quick on a small file so there is a large overhead.

  2. If the vertex only executes system generated code that we codegen into C++ code, then we can reuse containers without the re-initialization time. Unfortunately, we cannot reuse general user-code that gets executed with the managed runtime due to potential artifacts being left behind. So in that case we need to re-initialize the containers which will take time (over 2800 times).

Now based on your feedback, we are improving our reinitialization logic (we can still reinitialize if you do not do anything fancy with inline C# expressions). Also, it will get better once we can process several small files inside a single vertex instead of one file per vertex.

Workarounds for you is to increase the sizes of your files and to possibly avoid custom code (not always possible of course) to be in too many of the vertices.

Michael Rys
  • 6,684
  • 15
  • 23