Best practice for feeding spark dataframes for training Tensorflow network

Question

I want to feed data coming from spark clusters, to train a deep network. I do not have GPUs in the nodes, so distributed TensorFlow or packages like elephas is not an option.

I have come up with the following generator which does the job. It just retrieves the next batch from Spark. In order to handle batches I am adding an extra column index (which is simply and incremental id column), and filter on that on each call for the next batch.


class SparkBatchGenerator(tfk.utils.Sequence):
    def __init__(self, spark_df, batch_size, sample_count=None, feature_col='features', label_col='labels'):
        w = Window().partitionBy(sf.lit('a')).orderBy(sf.lit('a'))
        df = spark_df.withColumn('index', sf.row_number().over(w)).sort('index')
        self.X = df.select([feature_col, 'index'])
        self.y = df.select([label_col, 'index'])

        self.data_count = sample_count if sample_count else spark_df.count()
        self.feature_col = feature_col
        self.label_col = label_col
        self.batch_size = batch_size

    def __len__(self):
        return np.ceil(self.data_count /self.batch_size).astype(int)


    def __getitem__(self, idx):
        start, end = idx * self.batch_size, (idx + 1) * self.batch_size
        batch_x = (
            self.X.filter(f'index >= {start} and index < {end}')
                  .toPandas()[self.feature_col]
                  .apply(lambda x: x.toArray()).tolist()
        )
        batch_y = (
            self.y.filter(f'index >= {start} and index < {end}')
                  .toPandas()[self.label_col].tolist()
        )


        return np.array(batch_x), np.array(batch_y)

This works, but of course is slow, especial when batch_size is small. I was just wondering if anyone has a better solution.

why use ```spark``` for this? not ideal for DL ... maybe just stick to ```pandas``` on a single machine running ```tensorflow``` server — thePurplePython, Sep 11 '19 at 18:45
My data is coming from the Spark platform, it is out of my hands. It also won't fit in the memory. — Hamed, Sep 12 '19 at 08:24
is it in ```s3```, ```azure``` or ```hdfs``` storage? i saw ```toPandas()``` so figured it would be better for you to leverage ```python``` and just perform the mini-batches that way — thePurplePython, Sep 12 '19 at 13:11
What is the function of `partitionBy(sf.lit("a")).orderBy(sf.lit("a"))` ? — Chiel, Mar 03 '21 at 10:50
@Hamed: Does `partitionBy(sf.lit("a")).orderBy(sf.lit("a"))` shuffle the data? Or would a `orderBy(sf.rand(0))` be necessary to ensure a random order of the data? — Bill DeRose, May 27 '21 at 20:59

score 7 · Accepted Answer · edited Oct 31 '20 at 10:31

I used tf.data.Dataset to handle this. I can buffer the data coming from spark, and then leave the job of batch creation to tensorflow dataset api. It is now much faster:

class MyGenerator(object):
    def __init__(
        self, spark_df, buffer_size, feature_col="features", label_col="labels"
    ):
        w = Window().partitionBy(sf.lit("a")).orderBy(sf.lit("a"))
        self.df = (
            spark_df.withColumn("index", sf.row_number().over(w) - 1)
            .sort("index")
            .select([feature_col, "index", label_col])
        )

        self.feature_col = feature_col
        self.label_col = label_col
        self.buffer_size = buffer_size

    def generate_data(self):
        idx = 0
        buffer_counter = 0
        buffer = self.df.filter(
            f"index >= {idx} and index < {self.buffer_size}"
        ).toPandas()
        while len(buffer) > 0:
            if idx < len(buffer):
                X = buffer.iloc[idx][self.feature_col].toArray() / 255.0
                y = buffer.iloc[idx][self.label_col]

                idx += 1

                yield X.reshape((28, 28)), y
            else:
                buffer = self.df.filter(
                    f"index >= {buffer_counter * self.buffer_size} "
                    f"and index < {(buffer_counter + 1) * self.buffer_size}"
                ).toPandas()
                idx = 0
                buffer_counter += 1

batch_size = 128
buffer_size = 4*1024

my_gen = MyGenerator(feature_df, buffer_size)
dataset = tf.data.Dataset.from_generator(my_gen.generate_data, output_types=(tf.float32, tf.int32))
dataset = dataset.prefetch(tf.contrib.data.AUTOTUNE).batch(batch_size, drop_remainder=True)

Azhar Khan · Answer 2 · 2022-05-28T09:41:40.777

1

This is an old thread and the solution has lots of boilerplate code. If someone want a framework to do the heavy lifting then try TensorFlowOnSpark.

Best practice for feeding spark dataframes for training Tensorflow network

2 Answers2