0

I have this array of HashMap defined as below

var distinctElementsDefinitionMap: scala.collection.mutable.ArrayBuffer[HashMap[String, Int]] = new scala.collection.mutable.ArrayBuffer[HashMap[String, Int]](300) with scala.collection.mutable.SynchronizedBuffer[HashMap[String, Int]]

Now, I have a parallel collection of 300 elements

val max_length = 300
val columnArray = (0 until max_length).toParArray
import scala.collection.parallel.ForkJoinTaskSupport
columnArray.tasksupport = new ForkJoinTaskSupport(new scala.concurrent.forkjoin.ForkJoinPool(100))
columnArray foreach(i => {
    // Do Some Computation and get a HashMap
    var distinctElementsMap: HashMap[String, Int] = //Some Value
    //This line might result in Concurrent Access Exception
    distinctElementsDefinitionMap.update(i, distinctElementsMap)
})

I am now running a computation intensive task within a foreach loop on the columnArray defined above. After the computation is complete, I would like each of the threads to update a particular entry of the distinctElementsDefinitionMap array. Each thread would update only particular index value, unique to the thread executing it. I want to know if this updation of an entry of the array is safe with multiple threads possibly writing to it at the same time? If not is there a synchronized way of doing it so it's thread-safe? Thank You!

Update: It appears this is really not the safe way to do it. I am getting a java.util.ConcurrentModificationException Any tips on how to avoid this whilst using the parallel collections.

MV23
  • 285
  • 5
  • 17
  • You're abusing parallel collections -- it's not meant to be a stylish plain-old thread-pool, instead you handoff processing to a smarty pool (work stealing ftw!) and avoid using side effecting and then use processing results (likely, in a single threaded fashion). Once again, it's a **parallel** collections, not the **concurrent**. Perhaps you can give us a bigger picture of what you're trying to archive? – om-nom-nom Jul 11 '14 at 21:04
  • I totally agree, I know what I am doing is not the most optimal way or even a good way. But I am a mere beginner in Scala and still finding my way around. But I need a parallel loop and this is the only way that came to my mind. Apologies for the rudimentary approach! – MV23 Jul 11 '14 at 21:10
  • No worries, but it is not quite clear why do you need to update an entry in a map once you complete each task. If you clarify it perhaps we can come up with a alternative idiomatic solution. – om-nom-nom Jul 11 '14 at 21:14
  • 1
    Well, basically I have a 300 column, 8 mil row data set. I need to create a hashmap for each of the columns, find a mapping from String values to Integer values for each of the distinct values of this hashmap. Hence the need for an array of HashMaps. Each entry of the array is a hashmap corresponding to the distinct values of that column. One way to do it is sequentially, find hashmap of each column and update the `distinctElementsDefinitionMap` array. But I would like to speed it up and thus the use of a parallel collection. – MV23 Jul 11 '14 at 21:21
  • Edited my question to show the update being done – MV23 Jul 11 '14 at 21:25

1 Answers1

0

Use .groupBy operation, as far as I can judge it is parallelized (unlike some other methods, such as .sorted)

case class Row(a: String, b: String, c: String)
val data = Vector(
  Row("foo", "", ""), 
  Row("bar", "", ""), 
  Row("foo", "", "")
)

data.par.groupBy(x => x.a).seq
// Map(bar -> ParVector(Row(bar,,)), foo -> ParVector(Row(foo,,), Row(foo,,)))

Hope you got the idea.

Alternatively, if your RAM allows you, parallelize processing over each column, not row, it has to be waaaay more efficient than your current approach (less contention).

val columnsCount = 3 // 300 in your case
Vector.range(0, columnsCount).par.map { column => 
  data.groupBy(row => row(column))
}.seq 

Though you likely will have memory problems even with the single column (8M rows might be quite a lot).

om-nom-nom
  • 62,329
  • 13
  • 183
  • 228