number of tuples limit in RDD; reading RDD throws arrayIndexOutOfBoundsException

Question

I tried a modification of DF to RDD for a table containing 25 columns. Thereafter I came to know that Scala (until 2.11.8) has a limitation of a max of 22 tuples that could be used.

val rdd = sc.textFile("/user/hive/warehouse/myDB.db/myTable/")
rdd: org.apache.spark.rdd.RDD[String] = /user/hive/warehouse/myDB.db/myTable/ MapPartitionsRDD[3] at textFile at <console>:24

Sample Data:

[2017-02-26, 100052-ACC, 100052, 3260, 1005, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000]

Accessing each column:

val rdd3 = rdd.map(elements => {
val el = elements.split(",")
(el(0).substring(1,11).toString, el(1).toString ,el(2).toInt, el(3).toInt, el(4).toInt, el(5).sum.toDouble, el(6).sum.toDouble, el(7).sum.toDouble, el(8).sum.toDouble, el(9).sum.toDouble, el(10).sum.toDouble, el(11).sum.toDouble, el(12).sum.toDouble, el(13).sum.toDouble, el(14).sum.toDouble, el(15).sum.toDouble, el(15).sum.toDouble, el(17).sum.toDouble, el(18).sum.toDouble, el(19).sum.toDouble, el(20).sum.toDouble, el(21).sum.toDouble, el(22).sum.toDouble, el(23).sum.toDouble, el(24).sum.toDouble)
}
)

It throws an error:

<console>:1: error: too many elements for tuple: 26, allowed: 22

It's a bug in Scala (https://issues.scala-lang.org/browse/SI-9572). So I created a case class to go ahead with the problem.

case class HandleMaxTuple(col1:String, col2:String, col3: Int, col4: Int, col5: Int, col6: Double, col7: Double, col8: Double, col9: Double, col10: Double, col11: Double, col12: Double, col13: Double, col14: Double, col15: Double, col16: Double, col17: Double, col18: Double, col19: Double, col20: Double, col21: Double, col22: Double, col23: Double, col24: Double, col25:Double)

Thus the new rdd definition becomes:

val rdd3 = rdd.map(elements => {
val el = elements.split(",")
(HandleMaxTuple(el(0).substring(1,11).toString, el(1).toString,el(2).toInt, el(3).toInt, el(4).toInt, el(5).toDouble, el(6).toDouble, el(7).toDouble, el(8).toDouble, el(9).toDouble, el(10).toDouble, el(11).toDouble, el(12).toDouble, el(13).toDouble, el(14).toDouble, el(15).toDouble, el(15).toDouble, el(17).toDouble, el(18).toDouble, el(19).toDouble, el(20).toDouble, el(21).toDouble, el(22).toDouble, el(23).toDouble, el(24).toDouble))
}
)

However, when I try to read the contents of RDD:

rdd.take(2).foreach(println)

it throws me an exception of java.lang.ArrayIndexOutOfBoundsException:

Error Stack:

Driver stacktrace:
  at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486)
  at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
  at scala.Option.foreach(Option.scala:257)
  at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062)
  at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1354)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
  at org.apache.spark.rdd.RDD.take(RDD.scala:1327)
  ... 48 elided
Caused by: java.lang.ArrayIndexOutOfBoundsException: 1

Any idea why it's happening? Any workarounds?

Suppose you have problem while initializing HandleMaxTuple. Most probably from el(1). Check your string if it contain any commas you use as delimeter to split — Evgeny, Mar 27 '18 at 06:06
Just add a check on line `if el.length == 20` to filter out bad records — philantrovert, Mar 27 '18 at 06:50
@knowone I have tried the above code and it works perfectly for me. I used Scala Version 2.11.4 and Spark 1.3.1 — Minato, Mar 27 '18 at 07:04
@ Binoy J its really strange that its working for you , i got it running by doing to basic changes , i am using spark 2.3 — Rajnish Kumar, Mar 27 '18 at 07:12
@BinoyJ: It is strange that you got that running on a lower version of Scala which does report the bug isn't resolved until 2.11.8. — knowone, Mar 27 '18 at 19:51
@rajNishKuMar: Yeah 2.3 has some cool features, but I can't go on an upgrade as of now. — knowone, Mar 27 '18 at 19:52

score 1 · Accepted Answer · edited Jul 23 '21 at 05:16

I have tried to do exactly same as per your data using case class and I see two problems. First look at the answer:

package com.scalaspark.stackoverflow
import org.apache.spark.sql.SparkSession

object StackOverFlow {
  def main(args: Array[String]): Unit = {
    
    def parser(lines : String): HandleMaxTuple = {
      val fileds = lines.split(",")
      val c1 = fileds(0).substring(1,10).toString()
      val c2 = fileds(1).toString()
      val c3 = fileds(2).replaceAll("\\s","").toInt
      val c4 = fileds(3).replaceAll("\\s","").toInt
      val c5 = fileds(4).replaceAll("\\s","").toInt
      val c6 = fileds(5).replaceAll("\\s","").toDouble
      val c7 = fileds(6).replaceAll("\\s","").toDouble
      val c8 = fileds(7).replaceAll("\\s","").toDouble
      val c9 = fileds(8).replaceAll("\\s","").toDouble
      val c10 = fileds(9).replaceAll("\\s","").toDouble
      val c11 = fileds(10).replaceAll("\\s","").toDouble
      val c12 = fileds(11).replaceAll("\\s","").toDouble
      val c13 = fileds(12).replaceAll("\\s","").toDouble
      val c14 = fileds(13).replaceAll("\\s","").toDouble
      val c15 = fileds(14).replaceAll("\\s","").toDouble
      val c16 = fileds(15).replaceAll("\\s","").toDouble
      val c17 = fileds(16).replaceAll("\\s","").toDouble
      val c18 = fileds(17).replaceAll("\\s","").toDouble
      val c19 = fileds(18).replaceAll("\\s","").toDouble
      val c20 = fileds(19).replaceAll("\\s","").toDouble
      val c21 = fileds(20).replaceAll("\\s","").toDouble
      val c22 = fileds(21).replaceAll("\\s","").toDouble
      val c23 = fileds(22).replaceAll("\\s","").toDouble
      val c24 = fileds(23).replaceAll("\\s","").toDouble
      val c25 = fileds(24).replaceAll("\\s","").toDouble
   
      val handleMaxTuple : HandleMaxTuple = HandleMaxTuple(c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13,c14,c15,c16,c17,c18,c19,c20,c21,c22,c23,c24,c25)
      return handleMaxTuple 
    }
    val spark = SparkSession
                .builder()
                .appName("number of tuples limit in RDD")
                .master("local[*]")
                .getOrCreate()
                
    val lines = spark.sparkContext.textFile("C:\\Users\\rajnish.kumar\\Desktop\\sampleData.txt", 1)
    lines.foreach(println)
    val parseddata = lines.map(parser)
    parseddata.foreach(println)
  }
  
  case class HandleMaxTuple(col1:String, col2:String, col3: Int, col4: Int, col5: Int, col6: Double, col7: Double, col8: Double, col9: Double, col10: Double, col11: Double, col12: Double, col13: Double, col14: Double, col15: Double, col16: Double, col17: Double, col18: Double, col19: Double, col20: Double, col21: Double, col22: Double, col23: Double, col24: Double, col25:Double)
}

First problem is that for el(0) you are using substring() which as per Java doc should be:

String substring(int beginIndex, int endIndex)
Returns a new string that is a substring of this string.

When I go with el(0).substring(1,11) I get java.lang.StringIndexOutOfBoundsException: String index out of range: 11.

So go with el(0).substring(0,10) (as index starts from zero not from 1).

Second problem you are using toInt and doubles for some fields conversion but as I can see all of them contains a space in starting, so, beware that this can fail with a NumberFormatException just like it does in Java, like this:

scala> val i = "foo".toInt
java.lang.NumberFormatException: For input string: "foo"

For more info go to https://alvinalexander.com/scala/how-cast-string-to-int-in-scala-string-int-conversion. So to correct it I have used .replaceAll("\\s","") which removes all spaces just before the numbers and then converted them to int and doubles.

When you run above sample you will get output as:

HandleMaxTuple(2017-02-26, 100052-ACC,100052,3260,1005,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0)

Thanks for the details. Regarding the first problem, the reason I use `substring` is because the rdd that's created, it gives the first value as `[2017-02-26`, notice the `[` in the beginning. So to select only the date value, I've to use `substring`. The second problem could be an issue, yeah. But because the numbers are generated implicitly by RDD itself, I don't have control over them. But I'll try to use a replace before calling in the function & update here. — knowone, Mar 27 '18 at 19:59
Hi, I checked the data in multiple ways but the data seems to be proper & even the conversions in RDD above are done proper. Here are some excerpts: — knowone, Mar 28 '18 at 05:41
`rdd.take(1) res45: Array[String] = Array(201735?100102-DS?100102?D3?1198?1005?0?0?0?0?0?0?0?0?0?0?0?0?0?0?0?0?0?0?0?0?0?0?0?0?0?0?0?0?0?0?0?0?0?0?-34.37?-45.78?-50.4?-7?-7?-7?-34.37?0?0?0?0?0?0?0?-34.37?0?0?0?0?0?0?0?0?0?0?0?0?0?0?0?0?0?0?2017-02-26)` — knowone, Mar 28 '18 at 05:42
`rdd.take(1).foreach(println)` `201735100102-DS100102DS119810050000000000000000000000000000000000-34.37-45.78-50.4-7-7-7-34.370000000-34.370000000000000000002017-02-26` — knowone, Mar 28 '18 at 05:43
@ knowone so does it solved your problem ?or if still the same issue can you provide me the spark and scala version so that i can work over them and see if i can help you — Rajnish Kumar, Mar 28 '18 at 07:17
It didn't solve my problem but I removed it. Had to map the actual table structure with the conversion I was doing above in the RDD & found out that it wasn't mapped properly. There's some other problem even after resolving it, but that actually isn't related with the problem statement of this question so I can't put it here. — knowone, Mar 28 '18 at 07:30

number of tuples limit in RDD; reading RDD throws arrayIndexOutOfBoundsException

1 Answers1

Linked