Is my application running efficiently?

Question

The question is generic and can be extended to other frameworks or contexts beyond Spark & Machine Learning algorithms.

Regardless of the details, from a high-level point-of-view, the code is applied on a large dataset of labeled text documents. It passes by 9 iterations of cross-validation to tune some parameters of a Logistic Regression multi-class classifier.

It is expected that this kind of Machine Learning processing will be expensive in term of time and resources.

I am running now the code and everything seems to be OK, except that I have no idea if my application is running efficiently or not.

I couldn't find guidelines saying that for a certain type and amount of data, and for certain type of processing and computing resources the processing time should be in the approximate order of...

Is there any method that help in judging if my application is running slow or fast, or it is purely a matter of experience?

score 1 · Accepted Answer · answered Jan 19 '16 at 09:30

I had the same question and I didn't find a real answer/tool/way to test how good my performances were just looking "only inside" my application.

I mean, as far as I know, there's no tool like a speedtest or something like for the internet connection :-)

The only way I found is to re-write my app (if possible) with another stack in order to see if the difference (in terms of time) is THAT big.

Otherwise, I found very interesting 2 main resources, even if quite old:

1) A sort of 4 point guide to remember when coding:

Understanding the Performance of Spark Applications, SPark Summit 2013

2) A 2-episode article from Cloudera blog to tune at best your jobs: episode1 episode2

Hoping it could help

FF

score 1 · Answer 2 · answered Jan 19 '16 at 09:32

Your question is pretty generic, so I would also highlight few generic areas where you can look out for performance optimizations: -

Scheduling Delays - Are there significant scheduling delays in scheduling the tasks? if yes then you can analyze the reasons (may be your cluster needs more resources etc).
Utilization of Cluster - are your jobs utilizing the available cluster resources (like CPU, mem)? In case not then again look out for the reasons. May be creating more partitions helps in faster execution. May be there is significant time taken in serialization, so can you switch to Kyro Serialization.
JVM Tuning - Consider analyzing GC logs and tune if you find anomalies.
Executor Configurations - Analyze the memory/ cores provided to your executors. It should be sufficient to hold the data processed by the task/job. your DAG and
Driver Configuration - Same as executors, Driver should also have enough memory to hold the results of certain functions like collect().
Shuffling - See how much time is spend in Shuffling and kind of Data Locality used by your task.

All the above are needed for the preliminary investigations and in some cases it can also increase the performance of your jobs to an extent but there could be complex issues for which the solution will depend upon case to case basis.

Please also see Spark Tuning Guide

Is my application running efficiently?

2 Answers2