2

I'm trying to use Scala's parallel collections to dispatch some computations in parallel. Because there's a lot of input data, I'm using mutable arrays to store data to avoid GC issues. This is the initial approach I took:

// initialize the reusable input data structure
val inputData = new Array[Array[Int]](Runtime.getRuntime.availableProcessors*ChunkSize)
for (i <- 0 until inputData.length) {
  inputData(i) = new Array[Int](arraySize)
}

// process the input
while (haveMoreInput()) {
  // read the input--must be sequential!
  for (array <- 0 until inputData.length) {
    for (index <- 0 until arraySize) {
      array(index) = deserializeFromExternalSource()
    }
  }
  // map the data in parallel
  // note that the input data is NOT modified by longRuningProcess
  val results = for (array <- inputData.par) yield {
    longRunningProcess(array)
  }
  // use the results--must be sequential and ordered as input
  for (result <- results.toArray) {
    useResult(result)
  }
}

Given that a ParallelArray's underlying array can be safely reused (viz., modified and used as the underlying structure of another ParallelArray), the above snipped should work as expected. However, when run it crashes with a memory error:

*** Error in `*** Error in `java': double free or corruption (fasttop): <memory address> ***

This is ostensibly related to the fact that the parallel collection directly uses the array it was created from; perhaps it's attempting to free this array when it goes out of scope. In any case, creating a new array with each loop isn't an option, again, due to memory constraints. Explicitly creating a var parInputData = inputData.par both inside and outside of the while loop leads to the same double-free error.

I can't simply make inputData itself a parallel collection because it needs to be populated sequentially (having tried to make assignments to a parallel version, I realized that assignments were not performed in order). Using a Vector as the outer data structure seems to work for relatively small input sizes (< 1000000 input arrays) but leads to GC overhead exceptions on large inputs.

The approach I ended up taking involved making a Vector[Vector[Array[Int]]], with the outer vector having a length equal to the number of parallel threads being used. I then manually populated each sub-Vector with a chunk of input data arrays and then did a parallel map over the outer vector.

This final approach works, but it is tedious to manually separate the input into chunks and add those chunks to a parallel collection another level deep. Is there a way to allow Scala to reuse a mutable array for parallel operations?

EDIT: Benchmarking the parallel vector solution above against a manually-parallelized solution using synchronous queues showed the parallel vector to be about 50% slower. I'm wondering if this is simply the overhead of a better abstraction or if this gap can be reduced by use of parallel arrays rather than Vectors; this would lead to yet another benefit of using arrays versus Vectors.

Ben Sidhom
  • 1,548
  • 16
  • 25
  • 1
    Is longRunningProcess somehow related to JNI/JNA? Because I'm pretty much sure that unless you're hitting some obscure JVM bug, it's not possible to get `double free or corruption` just because of GC. – om-nom-nom Jul 31 '14 at 22:52
  • 1
    Nope, it's a pure Java method call. I'm not actually suggesting that this issue is caused by GC overhead but rather by the Scala parallel collections array implementation. – Ben Sidhom Jul 31 '14 at 22:57
  • I should also clarify that the `GC overhead` I'm referring to the `Vector` approach is truly a `java.lang.OutOfMemoryError` and not a double free JVM issue. – Ben Sidhom Jul 31 '14 at 22:58

1 Answers1

3

It doesn't really make sense to split your data into chunks, much of the point of the Parallel Collections library is that it does that for you, and does a much better job than using fixed chunk sizes. Also, arrays of arrays on the JVM are not like arrays of arrays in C, they are more like arrays of pointers to lots of little arrays, which makes them inefficient.

A more elegant way to solve this is to use ordinary Array and use ParRange to operate on it. longRunningProcess would have to be changed to operate on a single element at a time:

val arraySize = ???

val inputData = Array[Int](arraySize)
val outputData = Array[ResultType](arraySize)

while(haveMoreInput()) {
  for (i <- 0 until arraySize)
    inputData(i) = deserializeFromExternalSource()
  for (i <- (0 until arraySize).par)
    outputData(i) = longRunningProcess(inputData(i))
  outputData.foreach(useResult)
}

This uses only two large arrays, and never allocates any new arrays. ParArray.map, ParArray.toArray, and Array.par allocated new arrays in the original code.

We still have to use a fixed arraySize to make sure we don't load more data into memory that we have space for. A better solution would be to use reactive streams, but they aren't ready for production yet.

wingedsubmariner
  • 13,350
  • 1
  • 27
  • 52
  • I'm not actually "chunking" it into smaller arrays. The `longRunningProcess` is a Java method that I don't have control of and which requires an array as input. Consider each subarray a single input. – Ben Sidhom Aug 01 '14 at 20:13
  • I didn't notice your method of populating the output data though; it looks very useful for retaining ordering. A similar approach could be used even where the input data atoms are arrays. – Ben Sidhom Aug 01 '14 at 23:42