2

(EDIT: Looking at where this question started, it really ended up in a much better place. It wound up being a nice resource on the limits of RDD sizes in Spark when set through SparkContext.parallelize() vs. the actual size limits of RDDs. Also uncovered some arguments to parallelize() not found in user docs. Look especially at zero323's comments and his accepted answer.)

Nothing new under the sun but I can't find this question already asked ... the question is about how wrong/inadvisable/improper it might be to run a cast inside a large for loop in Java.

I want to run a for loop to initialize an Arraylist before passing it to a SparkContext.parallelize() method. I have found passing an uninitialized array to Spark can cause an empty collection error.

I have seen many posts about how floats and doubles are bad ideas as counters, I get that, just seems like this is a bad idea too? Like there must be a better way?

numListLen will be 10^6 * 10^3 for now, maybe as large at 10^12 at some point.

    List<Double> numList = new ArrayList<Double>(numListLen);
    for (long i = 0; i < numListLen; i++) {
        numList.add((double) i);
    }

I would love to hear where specifically this code falls down and can be improved. I'm a junior-level CS student so I haven't seen all the angles yet haha. Here's a CMU page seemingly approving this approach in C using implicit casting.


Just for background, numList is going to be passed to Spark to tell it how many times to run a simulation and create a RDD with the results, like this:

JavaRDD dataSet = jsc.parallelize(numList,SLICES_AKA_PARTITIONS);

    // the function will be applied to each member of dataSet
    Double count = dataSet.map(new Function<Double, Double>() {...

(Actually I'd love to run this Arraylist creation through Spark but it doesn't seem to take enough time to warrant that, 5 seconds on my i5 dual-core but if boosted to 10^12 then ... longer )

JimLohse
  • 1,209
  • 4
  • 19
  • 44
  • 1
    _I would love to hear where specifically this code falls down_ - for starters you cannot allocate `ArrayList` larger than `Integer.MAX_VALUE`. – zero323 Dec 20 '15 at 22:30
  • Oh yeah it rolls over and tries to assign a negative length to the ArrayList ... very useful feedback, I knew if I brought Spark in the picture I might get a response from you. Thanks and if you want to make this an answer, and address the core question of whether there's a more efficient way to cast, I'd accept the answer. – JimLohse Dec 20 '15 at 23:17
  • And I suppose, in that vein, the max "length" of a RDD is the max length of a scala collection? Per Spark Java docs for parallelize() aka makeRDD() -- public RDD makeRDD(scala.collection.Seq seq... I think that's a separate question gonna search for that and ask if I can't find it: max length of spark RDD – JimLohse Dec 20 '15 at 23:30
  • 1
    Not really. `parallelize` is simply not designed to initialize large RDDs. Obviously you cannot collect RDD larger than MAX_INT and you can hit other limitations but beyond that it should work just fine as long as you have enough resources. Regarding initialization I would simply initialize with any (empty) sequence and then initialize actual data using `mapPartitions` / `mapPartitionsWithIndex`. – zero323 Dec 20 '15 at 23:43
  • 1
    You could even create an infinite sequence from there :) I am pretty sure I've shown somewhere that it is possible using `flatMap`. – zero323 Dec 20 '15 at 23:46
  • 2
    Your code also implicitly boxes the `double` to `Double` which creates an object on every iteration. – Radiodef Dec 21 '15 at 00:11
  • Very helpful @Radiodef thanks! Also I was told that Scala enables a Range declaration like 1L to 200000000L inside parallelize() and that Java 8 should allow something similar. Still need to test that out so... – JimLohse Dec 21 '15 at 20:06
  • 1
    I've just realized there is a range method on a SparkContext. You should be able to use it directly. I've edited the answer. – zero323 Dec 28 '15 at 10:35

3 Answers3

2

The problem is using a double or float as the loop counter. In your case the loop counter is a long and does not suffer the same problem.

One problem with a double or float as a loop counter is that the floating point precision will leave gaps in the series of numbers represented. It is possible to get to a place within the valid range of a floating point number where adding one falls below the precision of the number being represented (requires 16 digits when the floating point format only supports 15 digits for example). If your loop went through such a point in normal execution it would not increment and continue in an infinite loop.

The other problem with doubles as loop counters is the ability to compare two floating points. Rounding means that to compare the variables successfully you need to look at values within a range. While you might find 1.0000000 == 0.999999999 your computer would not. So rounding might also make you miss the loop termination condition.

Neither of these problems occurs with your long as the loop counter. So enjoy having done it right.

2

Although I don't recommend the use of floating-point values (either single or double precision) as for-loop counters, in your case, where the step is not a decimal number (you use 1 as a step), everything depends on your largest expected number Vs the fraction part of double representation (52 bits).

Still, double numbers from 2^52..2^53 represent the integer part correctly, but after 2^53, you cannot always achieve integer-part precision.

In practice and because your loop step is 1, you would not experience any problems till 9,007,199,254,740,992 if you used double as counter and thus avoiding casting (you can't avoid boxing though from double to Double).

Perform a simple increment-test; you will see that 9,007,199,254,740,995 is the first false positive!

FYI: for float numbers, you are safe incrementing till 2^24 = 16777216 (in the article you provided, it uses the number 100000001.0f > 16777216 to present the problem).

Kostas Kryptos
  • 4,081
  • 2
  • 23
  • 24
2

davidstenberg and Konstantinos Chalkias already covered problems related to using Doubles as counters and radiodef pointed out an issue with creating objects in the loop but at the end of the day you simply cannot allocate ArrayList larger than Integer.MAX_VALUE. On top of that, even with 231 elements, this is a pretty large object and serialization and network traffic can add a substantial overhead to your job.

There a few ways you can handle this:

  • using SparkContext.range method:

    range(start: Long, end: Long, 
      step: Long = 1, numSlices: Int = defaultParallelism)
    
  • initializing RDD using range object. In PySpark you can use or range (xrange in Python 2), in Scala Range:

    val rdd = sc.parallelize(1L to Long.MaxValue) 
    

    It requires constant memory on the driver and constant network traffic per executor (all you have to transfer it just a beginning and end).

    In Java 8 LongStream.range could work the same way but it looks like JavaSparkContext doesn't provide required constructors yet. If you're brave enough to deal with all the singletons and implicits you can use Scala Range directly and if not you can simply write a Java friendly wrapper.

  • initialize RDD using emptyRDD method / small number of seeds and populate it using mapPartitions(WithIndex) / flatMap. See for example Creating array per Executor in Spark and combine into RDD

    With a little bit of creativity you can actually generate an infinite number of elements this way (Spark FlatMap function for huge lists).

  • given you particular use case you should also take a look at mllib.random.RandomRDDs. It provides number of useful generators from different distributions.

Community
  • 1
  • 1
zero323
  • 322,348
  • 103
  • 959
  • 935