I'm trying to randomise the order of elements in an RDD. My current approach is to zip the elements with an RDD of shuffled integers, then later join by those integers.
However, pyspark falls over with only 100000000 integers. I'm using the code below.
My question is: is there a better way to either zip with the random index or otherwise shuffle?
I've tried sorting by a random key, which works, but is slow.
def random_indices(n):
"""
return an iterable of random indices in range(0,n)
"""
indices = range(n)
random.shuffle(indices)
return indices
The following happens in pyspark:
Using Python version 2.7.3 (default, Jun 22 2015 19:33:41)
SparkContext available as sc.
>>> import clean
>>> clean.sc = sc
>>> clean.random_indices(100000000)
Killed