Counting records of my RDDs in a large Dstream

Question

I am trying to work with a large RDD as read by a file DStream.

The code looks as follows:

val creatingFunc = { () =>
  val conf = new SparkConf()
              .setMaster("local[10]")
              .setAppName("FileStreaming")
              .set("spark.streaming.fileStream.minRememberDuration", "2000000h")
              .registerKryoClasses(Array(classOf[org.apache.hadoop.io.LongWritable],
classOf[org.apache.hadoop.io.Text], classOf[GGSN]))

  val sc = new SparkContext(conf)

  // Create a StreamingContext
  val ssc = new StreamingContext(sc, Seconds(batchIntervalSeconds))

  val appFile = httpFileLines
                  .map(x=> (x._1,x._2.toString()))
                  .filter(!_._2.contains("ggsnIPAddress"))
                  .map(x=>(x._1,x._2.split(",")))

  var count=0

  appFile.foreachRDD(s => {
    // s.collect() throw exception due to insufficient amount of emery
    //s.count() throw exception due to insufficient amount of memory
  s.foreach(x => count = count + 1)
  })

  println(count)
  newContextCreated = true
  ssc
}

what I am trying to do is to get the count of my RDD..however since it is large..it throws exception..so I need to do a foreach instead to avoid collecting data to memory..

I wanna to get the count then as the way in my code but it always gives 0..

Is there a way to do this?

When dealing with RDD's, you can not accumulate sum into a local variable like this. You need to use an `org.apache.spark.Accumulator` or you can just call `Rdd.count` or `DStream.count` — sarveshseri, Aug 12 '16 at 08:07
Where is your `httpFileLines` being created ? Is it `RDD or `DStream` ? — sarveshseri, Aug 12 '16 at 08:09
Do you want the count of your rdds or count of all elements in the dstream ? — Knight71, Aug 12 '16 at 09:40

score 0 · Answer 1 · answered Aug 12 '16 at 07:39

There's no need to foreachRDD and call count. You can use the count method defined on DStream:

val appFile = httpFileLines
                .map(x => (x._1, x._2.toString()))
                .filter(!_._2.contains("ggsnIPAddress"))
                .map(x => (x._1, x._2.split(",")))

val count = appFile.count()

If that still yields an insufficient amount of memory exception, you either need to be calculating smaller batches of data each time, or enlarge you worker nodes to handle the load.

That doesnt return the counts of all elements in DStream, I still need to do foreach.. — Mahdi, Aug 21 '16 at 23:46

Mehdi · Answer 2 · 2018-09-03T16:12:17.077

Regarding your solution, you should avoid the collect and sum the count of each RDD of the DStream.

var count=0
appFile.foreachRDD { rdd => {
    count = count + rdd.count()
    }
}

But I found this solution very ugly (the use of a var in scala).

I prefer the following solution:

val count: Long = errorDStream.count().reduce(_+_)

Notice, that the count method return a DStream of Long and not a Long, this is why you need to use the reduce.

Counting records of my RDDs in a large Dstream

2 Answers2