4

Initially, I had a lot of data. But using spark-SQL and especially groupBy it could be trimmed down to a manageable size. (fits in RAM of a single node)

How can I perform functions (in parallel) on all the groups (distributed among my nodes)?

How can I make sure that the data for a single group is collected to a single node? E.g. I will probably want to use local matrix for computation but do not want to run into errors regarding data locality.

Georg Heiler
  • 16,916
  • 36
  • 162
  • 292

2 Answers2

2

Let's say you have x no. of executors(in your case probably 1 executor per node).And you want to partition the data on your keys in such a way that each key falls into a unique bucket which will be something like a perfect partitioner.There would be no generic way of doing that but it may be possible to achieve that if there is some inherent distribution/logic specific to your data.

I had dealt with a specific case where I found that Spark's inbuilt hash partitioner was not doing a good job of distributing the keys uniformly.So I wrote a custom partitioner using Guava like this:
  class FooPartitioner(partitions: Int) extends org.apache.spark.HashPartitioner(partitions: Int) {
    override def getPartition(key: Any): Int = {
      val hasherer = Hashing.murmur3_32().newHasher()
      Hashing.consistentHash(
        key match {
          case i: Int => hasherer.putInt(i).hash.asInt()
          case _ => key.hashCode
          },PARTITION_SIZE)
  }
 }

Then I added this partitioner instance as an argument to the combineBy that I was using so that resulting rdd is partitioned in this fashion. This does a good job of distributing data to x no of buckets but I guess there are no guarantees that each bucket will have only 1 key.

In case you are on Spark 1.6 and using dataframes you can define a udf like this val hasher = udf((i:Int)=>Hashing.consistentHash(Hashing.murmur3_32().newHasher().putInt(i) .hash.asInt(),PARTITION_SIZE)) and do dataframe.repartition(hasher(keyThatYouAreUsing)) Hopefully this provides some hint to get started.

sourabh
  • 466
  • 4
  • 13
  • But do I understand correctly a dataFrame.groupBy("someKey") will automatically change the partitioning. If I need a special partitioning a custom partitioner needs to be applied. And to calculate a function in parallel a UDF should be used? – Georg Heiler Apr 21 '16 at 06:54
  • Yeah; spark dataframe would most likely use the hash partitioner for repartitioning after groupBy .In rdd api you will notice that groupBy can take the partitioner as an argument.Ideally in dataframe if I do repartition followed By GroupBy in such a way that my repartition insures that all the keys reqd for a groupBy are present in the same partition then it should not do a shuffle;but I am not sure if this can be banked upon. Inherently in spark;a function will be applied in parallel;so yes a UDF is cmputed in parallel and so is any other transformation that you can do in .select or .map etc – sourabh Apr 21 '16 at 07:43
  • Thanks. I will need to look into it further, but this sounds like a good starting point. – Georg Heiler Apr 21 '16 at 07:44
1

I found a solution from Efficient UD(A)Fs with PySpark this blog

  1. mapPartitions to split data;
  2. udaf convert spark dataframe to pandas dataframe;
  3. do your data etl logic in udaf and return a pandas dataframe;
  4. udaf will convert pandas dataframe to spark dataframe;
  5. toDF() merge the result spark dataframe and do some persist like SaveAsTable;
df = df.repartition('guestid').rdd.mapPartitions(udf_calc).toDF()
n1tk
  • 2,406
  • 2
  • 21
  • 35
geosmart
  • 518
  • 4
  • 15
  • Welcome to Stack Overflow! Your answer should contain the solution, [not just a link](https://meta.stackexchange.com/a/8259). If it's too extensive, at least write an outline _how_ the linked content solves the problem. Also [answers that are little more than a link may be deleted](https://stackoverflow.com/help/deleted-answers). – ascripter Apr 03 '18 at 01:11