Each mapper will read file splits, do something with their splits (for example pre-aggregation like distinct) and produce dt, uid grouped and sorted by dt, so different dt
will be put in different files which will be consumed by reducers on the next step.
Reducers will read files prepared by mappers, so records with the same dt
will be read by the same reducer because records were distributed by dt and sorted on mapper.
Reducer will merge partial results(files from mappers) and do some count
aggregation. If some dt
were in the same file, records are sorted, it reduces the amount of work to be done on reducer.
cluster by dt
= distribute by dt sort by dt
Without cluster by
, two reducers may receive same dt, this will make impossible to perform count correctly because reducers do not know about each other and do not share data between them, same dt
will be counted partially on different reducers, final result will contain multiple records with the same dt