2

The followings are my scala spark code:

val vertex = graph.vertices
val edges = graph.edges.map(v=>(v.srcId, v.dstId)).toDF("key","value")
var FMvertex = vertex.map(v => (v._1, HLLCounter.encode(v._1)))
var encodedVertex = FMvertex.toDF("keyR", "valueR")

var Degvertex = vertex.map(v => (v._1, 0.toLong))
var lastRes = Degvertex
//calculate FM of the next step
breakable {
  for (i <- 1 to MaxIter) {
    var N_pre = FMvertex.map(v => (v._1, HLLCounter.decode(v._2)))
    var adjacency = edges.join(
      encodedVertex,//FMvertex.toDF("keyR", "valueR"),
      $"value" === $"keyR"
    ).rdd.map(r => (r.getAs[VertexId]("key"), r.getAs[Array[Byte]]("valueR"))).reduceByKey((a,b)=>HLLCounter.Union(a,b))
    FMvertex = FMvertex.union(adjacency).reduceByKey((a,b)=>HLLCounter.Union(a,b))

    // update vetex encode
    encodedVertex = FMvertex.toDF("keyR", "valueR")

    var N_curr = FMvertex.map(v => (v._1, HLLCounter.decode(v._2)))
    lastRes = N_curr
    var middleAns = N_curr.union(N_pre).reduceByKey((a,b)=>Math.abs(a-b))//.mapValues(x => x._1 - x._2)
    if (middleAns.values.sum() == 0){
      println(i)
      break
    }
    Degvertex = Degvertex.join(middleAns).mapValues(x => x._1 + i * x._2)//.map(identity)
  }
}
val res = Degvertex.join(lastRes).mapValues(x => x._1.toDouble / x._2.toDouble)
return res

In which I use several functions I defined in Java:

    import net.agkn.hll.HLL;
import com.google.common.hash.*;
import com.google.common.hash.Hashing;

import java.io.Serializable;

public class HLLCounter implements Serializable {
    private static int seed = 1234567;
    private static HashFunction hs = Hashing.murmur3_128(seed);

    private static int log2m = 15;
    private static int regwidth = 5;


    public static byte[] encode(Long id) {
        HLL hll = new HLL(log2m, regwidth);
        Hasher myhash = hs.newHasher();
        hll.addRaw(myhash.putLong(id).hash().asLong());
        return hll.toBytes();
    }

    public static byte[] Union(byte[] byteA, byte[] byteB) {
        HLL hllA = HLL.fromBytes(byteA);
        HLL hllB = HLL.fromBytes(byteB);
        hllA.union(hllB);
        return hllA.toBytes();
    }

    public static long decode(byte[] bytes) {
        HLL hll = HLL.fromBytes(bytes);
        return hll.cardinality();
    }
}

This code is used for calculating Effective Closeness on a large graph, and I used Hyperloglog package.

The code works fine when I ran it on a graph with about ten million vertices and hundred million of edges. However, when I ran it on a graph with thousands million of graph and billions of edges, after several hours running on clusters, it shows

Driver stacktrace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 91 in stage 29.1 failed 4 times, most recent failure: Lost task 91.3 in stage 29.1 (TID 17065, 9.10.135.216, executor 102): java.io.IOException: : No space left on device
 at java.io.FileOutputStream.writeBytes(Native Method)
 at java.io.FileOutputStream.write(FileOutputStream.java:326)
 at org.apache.spark.storage.TimeTrackingOutputStream.write(TimeTrackingOutputStream.java:58)
 at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)

Can anybody help me? I just begin to use spark for several days. Thank you for helping.

Xiaotian Han
  • 21
  • 1
  • 3
  • @KenWhite Let's assume 韩笑天 understands that. What can be done? I'm not familiar with apache-spark (I accessed this from a 1st post review queue) but in other domains, there are workarounds such as batching the task, doing the task on the cloud, doing the task in a more memory efficient way. Does any of that apply here? – Zev Jun 04 '18 at 02:47
  • 1
    Actually, my cluster has enough space to store data which I want to calculate. However, the spark log shows that my input was about 10GB, but the shuffle read and shuffle write is about 1TB. I donnot need those intermediate variables or RDDs, how can I eliminate those? Sorry for that I didn't say my questions clearly. Thank you. – Xiaotian Han Jun 04 '18 at 02:54
  • Sounds good! You can go back and edit your question with that additional info. – Zev Jun 04 '18 at 02:56

1 Answers1

2

Xiaotian, you state "The shuffle read and shuffle write is about 1TB. I do not need those intermediate values or RDDs". This statement affirms that you are not familiar with Apache Spark or possibly the algorithm you are running. Please let me explain.

When adding three numbers, you have to make a choice about the first two numbers to add. For example (a+b)+c or a+(b+c). Once that choice is made, there is a temporary intermediate value that is held for the number within the parenthesis. It is not possible to continue the computation across all three numbers without the intermediary number.

The RDD is a space efficient data structure. Each "new" RDD represents a set of operations across an entire data set. Some RDDs represent a single operation, like "add five" while others represent a chain of operations, like "add five, then multiply by six, and subtract by seven". You cannot discard an RDD without discarding some portion of your mathematical algorithm.

At its core, Apache Spark is a scatter-gather algorithm. It distributes a data set to a number of worker nodes, where that data set is part of a single RDD that gets distributed, along with the needed computations. At this point in time, the computations are not yet performed. As the data is requested from the computed form of the RDD, the computations are performed on-demand.

Occasionally, it is not possible to finish a computation on a single worker without knowing some of the intermediate values from other workers. This kind of cross communication between the workers always happens between the head node which distributes the data to the various workers and collects and aggregates the data from the various workers; but, depending on how the algorithm is structured, it can also occur mid-computation (especially in algorithms that groupBy or join data slices).

You have an algorithm that requires shuffling, in such a manner that a single node cannot collect the results from all of the other nodes because the single node doesn't have enough ram to hold the intermediate values collected from the other nodes.

In short, you have an algorithm that can't scale to accommodate the size of your data set with the hardware you have available.

At this point, you need to go back to your Apache Spark algorithm and see if it is possible to

  1. Tune the partitions in the RDD to reduce the cross talk (partitions that are too small might require more cross talk in shuffling as a fully connected inter-transfer grows at O(N^2), partitions that are too big might run out of ram within a compute node).
  2. Restructure the algorithm such that full shuffling is not required (sometimes you can reduce in stages such that you are dealing with more reduction phases, each phase having less data combine).
  3. Restructure the algorithm such that shuffling is not required (it is possible, but unlikely that the algorithm is simply mis-written, and factoring it differently can avoid requesting remote data from a node's perspective).
  4. If the problem is in collecting the results, rewrite the algorithm to return the results not in the head node's console, but in a distributed file system that can accommodate the data (like HDFS).

Without the nuts-and-bolts of your Apache Spark program, and access to your data set, and access to your Spark cluster and it's logs, it's hard to know which one of these common approaches would benefit you the most; so I listed them all.

Good Luck!

Edwin Buck
  • 69,361
  • 7
  • 100
  • 138
  • Thank you very very much, I am a beginner in spark, I learned a lot from this, thank you, I will try to improve my algorithm according to your advice, thank you again! – Xiaotian Han Jun 07 '18 at 05:55