9

sortPartition method of a dataset sorts the dataset locally based on some specified fields. How can I get my large Dataset sorted globally in an efficient way in Flink?

Ahmad.S
  • 779
  • 6
  • 25

1 Answers1

16

This is currently not easily possible because Flink does not provide a built-in range partitioning strategy, yet.

A work-around is to implement a custom Partitioner:

DataSet<Tuple2<Long, Long>> data = ...
data
  .partitionCustom(new Partitioner<Long>() {
    int partition(Long key, int numPartitions) {
      // your implementation
    }
  }, 0)
  .sortPartition(0, Order.ASCENDING)
  .writeAsText("/my/output");

Note: In order to achieve balanced partitions with a custom partitioner, you need to know about the value range and distribution of the key.

Support for a range partitioner (with automatic sampling) in Apache Flink is currently work in progress and should be available soon.

Edit (June 7th, 2016): Range partitioning was added to Apache Flink with version 1.0.0. You can globally sort a data set as follows:

DataSet<Tuple2<Long, Long>> data = ...
data
  .partitionByRange(0)
  .sortPartition(0, Order.ASCENDING)
  .writeAsText("/my/output");

Note that range partitioning samples the input data set to compute a data distribution for equally-sized partitions.

Fabian Hueske
  • 18,707
  • 2
  • 44
  • 49
  • 1
    1- if we don't have any insight over the dataset, how we can partition them? 2- assume we find a way to do so. does this command output a global sorted dataset? – Ahmad.S Dec 03 '15 at 22:15
  • 1) That's a good point. If you implement a custom partitioner, you should know about the value range and distribution of the key to achieve balanced partitions. The range partitioner in the linked pull request automatically samples the data to obtain a distribution. 2) Yes, if you range partition and sort each partition on the same key, the output will be globally sorted. – Fabian Hueske Dec 04 '15 at 08:34