3

Let's say that I have a massive collection of strings and I wish to use apache beam to sort it. Is this possible? I only managed to find documentation about running sort on a single machine, but what I'm looking for is a distributed sort algorithm.

tohava
  • 5,344
  • 1
  • 25
  • 47
  • Beam does not have such an operation. Why do you need a globally sorted dataset? Users often ask about this, but so far in every case I can remember, it turned out that what they want to do doesn't actually require a global sort. – jkff Feb 27 '18 at 21:53
  • I am considering using beam to replace another distributed pipeline system that DOES support global sort. This is a requirement from outside that I can't control. – tohava Feb 27 '18 at 23:05
  • Am I understanding correctly that you want to generate a set of files, where data inside each file is ordered, AND data between files is ordered (i.e. files with lexicographically smaller names have earlier data), AND the amount of data is such that even reading or writing this amount using one machine is impractical? (Beam can sort a lot of data on one machine - it doesn't have to fit into memory) – jkff Feb 27 '18 at 23:30
  • 1
    (one way to do this would be to take your entire dataset, write it to a distributed database with sorted keys like Cloud Bigtable, and then read it back in order and write to files) – jkff Feb 27 '18 at 23:31
  • Yes, you understand what I want. Thanks. – tohava Feb 27 '18 at 23:43

0 Answers0