Apache Mahout SimilarityAnalysis for CCO throwing NegativeArraySizeException

Question

When calling Apache Mahout's SimilarityAnalysis for CCO I get a fatal exception about a NegativeArraySizeException.

The code I'm running looks like this:

val result = SimilarityAnalysis.cooccurrencesIDSs(myIndexedDataSet:Array[IndexedDataset],
      randomSeed = 1234,
      maxInterestingItemsPerThing = 3,
      maxNumInteractions = 4)

I am seeing the following error and corresponding stack trace:

17/04/19 20:49:09 ERROR Executor: Exception in task 0.0 in stage 11.0 (TID 20)
java.lang.NegativeArraySizeException
    at org.apache.mahout.math.DenseVector.<init>(DenseVector.java:57)
    at org.apache.mahout.sparkbindings.SparkEngine$$anonfun$5.apply(SparkEngine.scala:73)
    at org.apache.mahout.sparkbindings.SparkEngine$$anonfun$5.apply(SparkEngine.scala:72)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
17/04/19 20:49:09 ERROR Executor: Exception in task 1.0 in stage 11.0 (TID 21)
java.lang.NegativeArraySizeException
    at org.apache.mahout.math.DenseVector.<init>(DenseVector.java:57)
    at org.apache.mahout.sparkbindings.SparkEngine$$anonfun$5.apply(SparkEngine.scala:73)
    at org.apache.mahout.sparkbindings.SparkEngine$$anonfun$5.apply(SparkEngine.scala:72)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
17/04/19 20:49:09 WARN TaskSetManager: Lost task 0.0 in stage 11.0 (TID 20, localhost): java.lang.NegativeArraySizeException
    at org.apache.mahout.math.DenseVector.<init>(DenseVector.java:57)
    at org.apache.mahout.sparkbindings.SparkEngine$$anonfun$5.apply(SparkEngine.scala:73)
    at org.apache.mahout.sparkbindings.SparkEngine$$anonfun$5.apply(SparkEngine.scala:72)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

I'm using Apache Mahout version 0.13.0

score 1 · Answer 1 · answered Apr 20 '17 at 16:14

This always means that one of the input matrices is empty. How many matrices are in the array? What is the number of rows and columns in each? There is a companion Object for IndexedDatasetSpark that supplies a constructor, called apply in Scala, that takes an RDD[String, String] so if you can get your data into the RDD, just construct the IndexedDatasetSpark with that. Here the pairs of strings are user-id, item-id for some event like a purchase.

See the Companion Object here: https://github.com/apache/mahout/blob/master/spark/src/main/scala/org/apache/mahout/sparkbindings/indexeddataset/IndexedDatasetSpark.scala#L75

A little searching will find code to turn a csv into RDD[String, String] with one line of code or so. It will look something like this:

val rawPurchaseInteractions = sc.textFile("/path/in/hdfs").map { line =>
  (line.split("\,")(0), (line.split("\,")(1))
}

Although this splits twice, it expects a comma delimited list of lines in a text file with user-id,item-id for some type of interaction like "purchase". If there are other fields in the file, just split to get the user-id and item-id. The line in the map function returns a pair of Strings so the resulting RDD will be of the right type, namely RDD[String, String]. Pass this in to the IndexedDatasetSpark with:

val purchasesRdd = IndexedDatasetSpark(rawPurchaseInteractions)(sc)

where sc is your Spark context. This should give you a non-empty IndexedDatasetSpark, which you can check by looking at the size of the wrapped BiDictionarys or by calling methods on the wrapped Mahout DRM.

BTW this assumes there is no header to the csv. This is text-delimited not full spec csv. Using other methods in Spark you can read real CSVs but there may be no need.

Thanks @pferrel for your response, I figured out what the issue was (see below) which had nothing to do with Mahout. — ldeluca, Apr 20 '17 at 18:39

score 0 · Answer 2 · answered Apr 20 '17 at 18:38

The problem actually had nothing to do with Mahout but an earlier line:

inputRDD.filter(_ (1) == primaryFilter).map(o => (o(0), o(2)))

The range was off, I had 1 to 3 instead of 0 to 2. I thought for sure the place it was made was within Mahout given the error but turns out this was the real issue.

Apache Mahout SimilarityAnalysis for CCO throwing NegativeArraySizeException

2 Answers2