0

I have a Spark application running on top of yarn. Having an RDD I need to execute a query against the database. The problem is that I have to set proper connection options otherwise the database will be overloaded. And these options depend on the number of workers that query this DB simultaneously. To solve this problem I want to detect the current number of running workers in runtime (from a worker). Something like that:

val totalDesiredQPS = 1000 //queries per second
val queries: RDD[String] = ???
queries.mapPartitions(it => {
      val dbClientForThisWorker = ...
      //TODO: get this information from YARN somehow
      val numberOfContainers = ???
      val dbClientForThisWorker.setQPS(totalDesiredQPS / numberOfContainers)
      it.map(query => dbClientForThisWorker.executeAsync...)
      ....
})

Also I appreciate alternative solutions but I want to avoid shuffle and get almost full db utilization no matter what the number of worker is.

simpadjo
  • 3,947
  • 1
  • 13
  • 38
  • I'd say it's a job for [Hadoop Fair Scheduler](https://hadoop.apache.org/docs/r2.7.1/hadoop-yarn/hadoop-yarn-site/FairScheduler.html) - it can be configured correspondingly – MaxU - stand with Ukraine Oct 19 '17 at 08:57
  • @MaxU I anyway can't estimate the number of workers before runtime. I can make it fixed but in this case I can't utilize cluster elasticity. – simpadjo Oct 19 '17 at 09:01
  • I guess you can think like this if you are the only user of that hadoop cluster. But imagine that you have 1000 people, that have access to the cluster and some of them want to have their jobs to be done as fast as possible and they don't care about others. Fair Scheduler is designed to manage such things... – MaxU - stand with Ukraine Oct 19 '17 at 09:24
  • 1
    This looks like it may be a duplicate of [How many Executors and Cores are allocated to my spark job](https://stackoverflow.com/a/39163142/3693889) – Andrew Mo Oct 19 '17 at 09:39
  • @AndrewMo thank you, very close to what I'm looking for. But is it possible to run this code on a worker somehow? – simpadjo Oct 19 '17 at 09:48
  • Have you considered putting that information into a broadcast variable? – Andrew Mo Oct 19 '17 at 09:50
  • @AndrewMo the problem is the number of available workers can change over time significantly. With broadcast I can pass only the information available at the time of submitting an RDD. And I want to get this information right when the 'mapPartition' phase starts – simpadjo Oct 19 '17 at 09:56
  • Have you considered having the Driver poll/determine the number of available workers and store this quantity in an external shared cache (e.g. Redis); you could define a function to be used by any active workers to retrieve the current value from that cache at the time of execution. – Andrew Mo Oct 19 '17 at 10:39

0 Answers0