error when giving directory as input path to spark-itemsimilarity?

Question

I am getting following error when running mahout spark-itemsimilarity from terminal with input path to directory.

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 0
    at org.apache.mahout.math.cf.SimilarityAnalysis$.cooccurrencesIDSs(SimilarityAnalysis.scala:119)
    at org.apache.mahout.drivers.ItemSimilarityDriver$.process(ItemSimilarityDriver.scala:214)
    at org.apache.mahout.drivers.ItemSimilarityDriver$$anonfun$main$1.apply(ItemSimilarityDriver.scala:116)
    at org.apache.mahout.drivers.ItemSimilarityDriver$$anonfun$main$1.apply(ItemSimilarityDriver.scala:114)
    at scala.Option.map(Option.scala:145)
    at org.apache.mahout.drivers.ItemSimilarityDriver$.main(ItemSimilarityDriver.scala:114)
    at org.apache.mahout.drivers.ItemSimilarityDriver.main(ItemSimilarityDriver.scala)

Thanks in advance.

What version of Mahout are you using? This has been caused by a bad input line such as a line with a null. Can you send me your input? — pferrel, May 28 '15 at 15:25

pferrel · Accepted Answer · 2015-05-28T20:02:14.530

Use Mahout 0.10.1-SNAPSHOT on the 0.10.x branch in Github since it does not need the -D:spark... option.

Using a directory as input requires a pattern to match files. The default pattern matches HDFS "part-xxxxx" files. Use the following command:

$ mahout spark-itemsimilarity -i /home/kulwant/data/ -fp ".*csv" -o /home/kulwant/output/ --master spark://kulwant-VirtualBox:7077 -id "," --itemIDColumn 0 --rowIDColumn 1

RowID = user id so given your data I think you have the item and row columns reversed. The item id seems to be in column 0 and the row/user is in column 1 (I've fixed above).

score 0 · Answer 2 · answered May 28 '15 at 16:25

@eliasah

./mahout spark-itemsimilarity -D:spark.executor.extraClassPath=/home/kulwant/mahout/spark/target/mahout-spark_2.10-0.11.0-SNAPSHOT-dependency-reduced.jar --input /home/kulwant/data/

--output /home/kulwant/output --master spark://kulwant-VirtualBox:7077 --inDelim , --itemIDColumn 1 --rowIDColumn 0

This is the command which i execute from terminal

error when giving directory as input path to spark-itemsimilarity?

2 Answers2