(EDIT: Looking at where this question started, it really ended up in a much better place. It wound up being a nice resource on the limits of RDD sizes in Spark when set through SparkContext.parallelize() vs. the actual size limits of RDDs. Also uncovered some arguments to parallelize() not found in user docs. Look especially at zero323's comments and his accepted answer.)
Nothing new under the sun but I can't find this question already asked ... the question is about how wrong/inadvisable/improper it might be to run a cast inside a large for loop in Java.
I want to run a for loop to initialize an Arraylist before passing it to a SparkContext.parallelize() method. I have found passing an uninitialized array to Spark can cause an empty collection error.
I have seen many posts about how floats and doubles are bad ideas as counters, I get that, just seems like this is a bad idea too? Like there must be a better way?
numListLen will be 10^6 * 10^3 for now, maybe as large at 10^12 at some point.
List<Double> numList = new ArrayList<Double>(numListLen);
for (long i = 0; i < numListLen; i++) {
numList.add((double) i);
}
I would love to hear where specifically this code falls down and can be improved. I'm a junior-level CS student so I haven't seen all the angles yet haha. Here's a CMU page seemingly approving this approach in C using implicit casting.
Just for background, numList is going to be passed to Spark to tell it how many times to run a simulation and create a RDD with the results, like this:
JavaRDD dataSet = jsc.parallelize(numList,SLICES_AKA_PARTITIONS);
// the function will be applied to each member of dataSet
Double count = dataSet.map(new Function<Double, Double>() {...
(Actually I'd love to run this Arraylist creation through Spark but it doesn't seem to take enough time to warrant that, 5 seconds on my i5 dual-core but if boosted to 10^12 then ... longer )