I am working on optimizing a data pipeline that leverages Apache Spark, HDFS and YARN as the cluster manager. The Spark Cluster consists of a limited amount of internal machines that are shared across a variety of groups. Thus, building certain components of the pipeline will take different times depending on how heavily these machines are being used. I am trying to come up with a metric to judge on how much my optimizations improve the performance of the existing data pipeline, on a component to component basis. Right now, the two that I could think of are:
1) Memory usage during build * Amount of time taken to build component
2) Number of CPUs used during build * Amount of time taken to build component
What are your thoughts on these metrics? What is a more accurate measurement? Are there better measurements of performance? I would be open to any suggestions as I am new to world of Big Data. Any help would be much appreciated!
Thanks,
Taylor