7

In spark, is there a fast way to get an approximate count of the number of elements in a Dataset ? That is, faster than Dataset.count() does.

Maybe we could calculate this information from the number of partitions of the DataSet, could we ?

lovasoa
  • 6,419
  • 1
  • 35
  • 45

2 Answers2

11

You could try to use countApprox on RDD API, altough this also launches a Spark job, it should be faster as it just gives you an estimate of the true count for a given time you want to spend (milliseconds) and a confidence interval (i.e. the probabilty that the true value is within that range):

example usage:

val cntInterval = df.rdd.countApprox(timeout = 1000L,confidence = 0.90)
val (lowCnt,highCnt) = (cntInterval.initialValue.low, cntInterval.initialValue.high)

You have to play a bit with the parameters timeout and confidence. The higher the timeout, the more accurate is the estimated count.

Raphael Roth
  • 26,751
  • 15
  • 88
  • 145
  • Thank you very much, this is exactly what I was looking for. – lovasoa May 31 '17 at 11:16
  • A little precision. If like me you need a single number and not an interval, than you'd better set the confidence to 0 (and thus get a single value for low and high). If you use a high value (like the default of 0.95), and then use the mean value of `low` and `high`, the result will be less precise. – lovasoa Jun 14 '17 at 21:57
  • 2
    I have tried this on large datasets and it does not appear to save much (if any..) time. – WestCoastProjects Jul 09 '18 at 13:09
  • I don't think this will save much time since I've read that calling `df.rdd` is expensive – alex Jan 12 '21 at 15:27
2

If you have a truly enormous number of records, you can get an approximate count using something like HyperLogLog and this might be faster than count(). However you won't be able to get any result without kicking off a job.

When using Spark there are two kinds of RDD operations: transformations and actions. Roughly speaking, transformations modify an RDD and return a new RDD. Actions calculate or generate some result. Transformations are lazily evaluated, so they don't kick off a job until an action is called at the end of a sequence of transformations.

Because Spark is a distributed batch programming framework, there is a lot of overhead for running jobs. If you need something that feels more like "real time" whatever that means, either use basic Scala (or Python) if your data is small enough, or move to a streaming approach and do something like update a counter as new records flow through.

Metropolis
  • 2,018
  • 1
  • 19
  • 36
  • 1
    hyperloglog would be to count the number of distinct items in the dataset, which does not address the OP question – vpipkt Jul 14 '21 at 15:09