I have a huge dataframe of different item_id
and its related data, I need to process each group with the item_id
serparately in parallel, I tried the to repartition
the dataframe by item_id
using the below code, but it seems it's still being processed as a whole not chunks
data = sqlContext.read.csv(path='/user/data', header=True)
columns = data.columns
result = data.repartition('ITEM_ID') \
.rdd \
.mapPartitions(lambda iter: pd.DataFrame(list(iter), columns=columns))\
.mapPartitions(scan_item_best_model)\
.collect()
also is repartition
is the correct approach or there is something am doing wrong ?