I want to feed data coming from spark clusters, to train a deep network. I do not have GPUs in the nodes, so distributed TensorFlow or packages like elephas
is not an option.
I have come up with the following generator which does the job. It just retrieves the next batch from Spark. In order to handle batches I am adding an extra column index
(which is simply and incremental id column), and filter on that on each call for the next batch.
class SparkBatchGenerator(tfk.utils.Sequence):
def __init__(self, spark_df, batch_size, sample_count=None, feature_col='features', label_col='labels'):
w = Window().partitionBy(sf.lit('a')).orderBy(sf.lit('a'))
df = spark_df.withColumn('index', sf.row_number().over(w)).sort('index')
self.X = df.select([feature_col, 'index'])
self.y = df.select([label_col, 'index'])
self.data_count = sample_count if sample_count else spark_df.count()
self.feature_col = feature_col
self.label_col = label_col
self.batch_size = batch_size
def __len__(self):
return np.ceil(self.data_count /self.batch_size).astype(int)
def __getitem__(self, idx):
start, end = idx * self.batch_size, (idx + 1) * self.batch_size
batch_x = (
self.X.filter(f'index >= {start} and index < {end}')
.toPandas()[self.feature_col]
.apply(lambda x: x.toArray()).tolist()
)
batch_y = (
self.y.filter(f'index >= {start} and index < {end}')
.toPandas()[self.label_col].tolist()
)
return np.array(batch_x), np.array(batch_y)
This works, but of course is slow, especial when batch_size
is small. I was just wondering if anyone has a better solution.