Best practices for distributed profiling

Question

perhaps nothing new but we have a use case to profile a spark application that runs in a distributed fashion

We currently use the async-profiler that monitors each executor ( a process in spark ) and generates a JFR per process. It's a little annoying to look at individual executor profiles and make any sense/compare

We are using the JFR assemble to combine all the JFR's produced. Curious, is this how distributed profiling done?

/async-profiler/profiler.sh collect -e cpu -d 120 -i 20ms -o jfr -f ${file} ${pid}

This is run periodically every 120 seconds thus creating a continuous mode profiling

The benchmark we are running is to run a job in a cluster of EC2 2xl vs EC2 4xl and what we are noticing is that on 4xl our jobs are running slower. The 2xl cluster has twice the number of machines as 4xl

Each process uses 8cores, 54gb heap. On 2xl, each machine runs a single process but on 4xl, we run 2 process per machine without any isolation

Any leads on how to debug this is appreciated. Let us know if I need to add any more options to the async-profiler. We clearly see more time spent on CPU hence the -e CPU

Best practices for distributed profiling

0 Answers0