spark streaming throughput monitoring

Question

Is there a way to monitor the input and output throughput of a Spark cluster, to make sure the cluster is not flooded and overflowed by incoming data?

In my case, I set up Spark cluster on AWS EC2, so I'm thinking of using AWS CloudWatch to monitor the NetworkIn and NetworkOut for each node in the cluster.

But my idea seems to be not accurate and network does not meaning incoming data for Spark only, maybe also some other data would be calculated too.

Is there a tool or way to monitor specifically for Spark cluster streaming data status? Or there's already a built-in tool in Spark that I missed?

update: Spark 1.4 released, monitoring at port 4040 is significantly enhanced with graphical display

Hi @keypoint! Did you find out how to configure "conf/metrics.properties" so that Spark Metrics are displayed in CloudWatch? — Erica, Mar 07 '17 at 11:10

maasg · Accepted Answer · 2015-05-02T09:53:13.850

Spark has a configurable metric subsystem. By default it publishes a JSON version of the registered metrics on <driver>:<port>/metrics/json. Other metrics syncs, like ganglia, csv files or JMX can be configured.

You will need some external monitoring system that collects metrics on a regular basis an helps you make sense of it. (n.b. We use Ganglia but there's other open source and commercial options)

Spark Streaming publishes several metrics that can be used to monitor the performance of your job. To calculate throughput, you would combine:

(lastReceivedBatch_processingEndTime-lastReceivedBatch_processingStartTime)/lastReceivedBatch_records

For all metrics supported, have a look at StreamingSource

Example: Starting a local REPL with Spark 1.3.1 and after executing a trivial streaming application:

import org.apache.spark.streaming._
val ssc = new StreamingContext(sc, Seconds(10))
val queue = scala.collection.mutable.Queue(1,2,3,45,6,6,7,18,9,10,11)
val q = queue.map(elem => sc.parallelize(Seq(elem)))
val dstream = ssc.queueStream(q)
dstream.print
ssc.start

one can GET localhost:4040/metrics/json and that returns:

{
version: "3.0.0",
gauges: {
local-1430558777965.<driver>.BlockManager.disk.diskSpaceUsed_MB: {
value: 0
},
local-1430558777965.<driver>.BlockManager.memory.maxMem_MB: {
value: 2120
},
local-1430558777965.<driver>.BlockManager.memory.memUsed_MB: {
value: 0
},
local-1430558777965.<driver>.BlockManager.memory.remainingMem_MB: {
value: 2120
},
local-1430558777965.<driver>.DAGScheduler.job.activeJobs: {
value: 0
},
local-1430558777965.<driver>.DAGScheduler.job.allJobs: {
value: 6
},
local-1430558777965.<driver>.DAGScheduler.stage.failedStages: {
value: 0
},
local-1430558777965.<driver>.DAGScheduler.stage.runningStages: {
value: 0
},
local-1430558777965.<driver>.DAGScheduler.stage.waitingStages: {
value: 0
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.lastCompletedBatch_processingDelay: {
value: 44
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.lastCompletedBatch_processingEndTime: {
value: 1430559950044
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.lastCompletedBatch_processingStartTime: {
value: 1430559950000
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.lastCompletedBatch_schedulingDelay: {
value: 0
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.lastCompletedBatch_submissionTime: {
value: 1430559950000
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.lastCompletedBatch_totalDelay: {
value: 44
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.lastReceivedBatch_processingEndTime: {
value: 1430559950044
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.lastReceivedBatch_processingStartTime: {
value: 1430559950000
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.lastReceivedBatch_records: {
value: 0
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.lastReceivedBatch_submissionTime: {
value: 1430559950000
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.receivers: {
value: 0
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.retainedCompletedBatches: {
value: 2
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.runningBatches: {
value: 0
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.totalCompletedBatches: {
value: 2
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.totalProcessedRecords: {
value: 0
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.totalReceivedRecords: {
value: 0
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.unprocessedBatches: {
value: 0
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.waitingBatches: {
value: 0
}
},
counters: { },
histograms: { },
meters: { },
timers: { }
}

Hi maasg how did you set the file "conf/metrics.properties"? I posted a separate answer described what I found. thanks a lot — keypoint, May 01 '15 at 21:02
@keypoint You can configure the location of the property file by providing `spark.metrics.conf = .properties` if it's not found on the filesystem at the specified path, Spark will try to load it from the classpath. — maasg, May 02 '15 at 09:22
@keypoint I updated the answer with a very small example you could try yourself. Hope that helps. — maasg, May 02 '15 at 09:55
that's so cool ! Now I get the same message you have here, thank you so much — keypoint, May 02 '15 at 22:14
I am using ganglia to monitor spark, but Ganglia metrics XML reporter doesn't escape correctly, which breaks gmond. this is my configuration `*.sink.ganglia.class=org.apache.spark.metrics.sink.GangliaSink ``*.sink.ganglia.host=127.0.0.1 *.sink.ganglia.port=8649 ``*.sink.ganglia.period=10 *.sink.ganglia.unit=seconds *.sink.ganglia.ttl=1` `*.sink.ganglia.mode=unicast` — Junaid, Jul 29 '15 at 12:01

score -1 · Answer 2 · answered Jun 18 '20 at 00:32

-1

I recommend using https://spark.apache.org/docs/latest/monitoring.html#metrics with Prometheus (https://prometheus.io/).

Metrics generated by Spark metrics can be captured using Prometheus and It offers UI as well. Prometheus is a free tool.

answered Jun 18 '20 at 00:32

Shiva Garg

826
9
17

spark streaming throughput monitoring

2 Answers2

Linked