What is the best way to judge the performance of components of a data pipeline?

Asked May 21 '19 at 20:48

Active May 21 '19 at 20:48

Viewed 37 times

I am working on optimizing a data pipeline that leverages Apache Spark, HDFS and YARN as the cluster manager. The Spark Cluster consists of a limited amount of internal machines that are shared across a variety of groups. Thus, building certain components of the pipeline will take different times depending on how heavily these machines are being used. I am trying to come up with a metric to judge on how much my optimizations improve the performance of the existing data pipeline, on a component to component basis. Right now, the two that I could think of are:

1) Memory usage during build * Amount of time taken to build component

2) Number of CPUs used during build * Amount of time taken to build component

What are your thoughts on these metrics? What is a more accurate measurement? Are there better measurements of performance? I would be open to any suggestions as I am new to world of Big Data. Any help would be much appreciated!

Thanks,

Taylor

asked May 21 '19 at 20:48

jewelltaylor9430

How would you be able to measure 2? – tk421 May 21 '19 at 22:46
YARNs UI provides this information if I am not mistaken. Then again, you may know better than me hahah Am I wrong about this assumption? If so, how would you go about doing it? – jewelltaylor9430 May 21 '19 at 23:56
Virtual cores are not CPU cores. I would look at what you actually measure first. The Hadoop counters would be a good place to start. – tk421 May 22 '19 at 15:50
FYI - https://stackoverflow.com/questions/55764777/tuning-hadoop-parameters – tk421 May 24 '19 at 21:52
Thanks for your help! – jewelltaylor9430 May 28 '19 at 14:30

What is the best way to judge the performance of components of a data pipeline?

0 Answers0