8

I'm trying to randomise the order of elements in an RDD. My current approach is to zip the elements with an RDD of shuffled integers, then later join by those integers.

However, pyspark falls over with only 100000000 integers. I'm using the code below.

My question is: is there a better way to either zip with the random index or otherwise shuffle?

I've tried sorting by a random key, which works, but is slow.

def random_indices(n):
    """
    return an iterable of random indices in range(0,n)
    """
    indices = range(n)
    random.shuffle(indices)
    return indices

The following happens in pyspark:

Using Python version 2.7.3 (default, Jun 22 2015 19:33:41)
SparkContext available as sc.
>>> import clean
>>> clean.sc = sc
>>> clean.random_indices(100000000)
Killed
Marcin
  • 48,559
  • 18
  • 128
  • 201

2 Answers2

5

One possible approach is to add random keys using mapParitions

import os
import numpy as np

swap = lambda x: (x[1], x[0])

def add_random_key(it):
    # make sure we get a proper random seed
    seed = int(os.urandom(4).encode('hex'), 16) 
    # create separate generator
    rs = np.random.RandomState(seed)
    # Could be randint if you prefer integers
    return ((rs.rand(), swap(x)) for x in it)

rdd_with_keys = (rdd
  # It will be used as final key. If you don't accept gaps 
  # use zipWithIndex but this should be cheaper 
  .zipWithUniqueId()
  .mapPartitions(add_random_key, preservesPartitioning=True))

Next you can repartition, sort each partition and extract values:

n = rdd.getNumPartitions()
(rdd_with_keys
    # partition by random key to put data on random partition 
    .partitionBy(n)
    # Sort partition by random value to ensure random order on partition
    .mapPartitions(sorted, preservesPartitioning=True)
    # Extract (unique_id, value) pairs
    .values())

If sorting per partition is still to slow it could be replaced by Fisher–Yates shuffle.

If you simply need a random data then you can use mllib.RandomRDDs

from pyspark.mllib.random import RandomRDDs

RandomRDDs.uniformRDD(sc, n)

Theoretically it could be zipped with input rdd but it would require matching the number of elements per partition.

zero323
  • 322,348
  • 103
  • 959
  • 935
  • Thanks, this is useful. I actually need the keys to be unique. – Marcin Aug 20 '15 at 14:00
  • Do you have any other requirements here? Because if not, you can simply `zipWithIndex` `zipWithUniqueId` afterwards. It adds another transformation but is not extremely expensive. – zero323 Aug 20 '15 at 14:15
  • I need the keys to be both randomly ordered and unique. I can sort by a random key, but that proves to be quite slow. – Marcin Aug 20 '15 at 14:20
  • If so you can add index or id before shuffling. I've updated the answer. – zero323 Aug 20 '15 at 14:38
  • In testing now, this seems like it takes about as long as sorting (I'm using shuffling); however, I think I can just not shuffle within the partitions, so thank you very much for this. It has definitely helped me learn more about using spark as well as speeding up my code. – Marcin Aug 20 '15 at 18:37
  • 1
    Well, truth be told is not that much different from sorting. While there is no need for histogram you have to shuffle pretty much the same amount of data. I doubt there is any way to avoid it if a true random order is required. – zero323 Aug 20 '15 at 18:59
  • Actually, I spoke too soon - shuffling was 26 mins vs about 40 mins for sorting. – Marcin Aug 20 '15 at 19:13
-1

pyspark worked!

from random import randrange
data_rnd = data.sortBy(lambda x: randrange(1000000))

Colin Wang
  • 771
  • 8
  • 14