0

I have a huge number of 128-bit unsigned integers that need to be sorted for analysis (around a trillion of them!).

The research I have done on 128-bit integers has led me down a bit of a blind alley, numpy doesn't seem to fully support them and the internal sorting functions are memory intensive (using lists).

What I'd like to do is load, for example, a billion 128-bit unsigned integers into memory (16GB if just binary data) and sort them. The machine in question has 48GB of RAM so should be OK to use 32GB for the operation. If it has to be done in smaller chunks that's OK, but doing as large a chunk as possible would be better. Is there a sorting algorithm that Python has which can take such data without requiring a huge overhead?

I can sort 128-bit integers using the .sort method for lists, and it works, but it can't scale to the level that I need. I do have a C++ version that was custom written to do this and works incredibly quickly, but I would like to replicate it in Python to accelerate development time (and I didn't write the C++ and I'm not used to that language).

Apologies if there's more information required to describe the problem, please ask anything.

LocalGeek
  • 95
  • 1
  • 5
  • Can you write the numbers to a text file, one per line, and use `sort -n` from the command line? – John Gordon Oct 21 '18 at 23:38
  • John, thanks but I don't think that text files will help as the data is already 16TB when represented as binary files. That might work for some types of sort, especially if it's a one-off, but I'm guessing (but could be wrong) that it's also not going to scale well. – LocalGeek Oct 22 '18 at 00:08

2 Answers2

0

NumPy doesn't support 128-bit integers, but if you use a structured dtype composed of high and low unsigned 64-bit chunks, those will sort in the same order as the 128-bit integers would:

arr.sort(order=['high', 'low'])

As for how you're going to get an array with that dtype, that depends on how you're loading your data in the first place. I imagine it might involve calling ndarray.view to reinterpret the bytes of another array. For example, if you have an array of dtype uint8 whose bytes should be interpreted as little-endian 128-bit unsigned integers, on a little-endian machine:

arr_structured = arr_uint8.view([('low', 'uint64'), ('high', 'uint64')])

So that might be reasonable for a billion ints, but you say you've got about a trillion of these. That's a lot more than an in-memory sort on a 48GB RAM computer can handle. You haven't asked for something to handle the whole trillion-element dataset at once, so I hope you already have a good solution in mind for merging sorted chunks, or for pre-partitioning the dataset.

user2357112
  • 260,549
  • 28
  • 431
  • 505
  • Thanks, I will look into this and report back. I have a solution for the thousand/s of sorted chunks. They don't need to be merged as they are presorted into 1048 files based on the first 10 bits. The 1048 source files (around a billion 128-bit integers each) were created with this in mind. The big task will be to do the analysis, but that's a different story and it's why a fast development environment like Python (to test it on a smaller scale) is perfect. – LocalGeek Oct 21 '18 at 23:57
  • My results, albeit I'm new to this, show that NumPy is slower than built-in list sorting for this use case. The population of NumPy arrays with random data is twice as slow as doing similar to a list (created a zeroed NumPy array first then looped through to put random uint64 values in the array, list was able to be given a random uint128, after also being created zeroed, which might account for the difference). The sorting speed was a little slower, around 30% or so. Thank you for your suggestion, it taught me a few things anyway. Splitting the source files further seems to be the easiest way – LocalGeek Oct 22 '18 at 16:53
  • 1
    @LocalGeek: Looping through a NumPy array manually is slow. You're not supposed to loop through them at Python level; you're supposed to write code in a way that pushes loops into C, such as passing a `size` argument to `numpy.random.randint`. – user2357112 Oct 22 '18 at 16:57
  • Thanks. I kind of know I wasn't doing that part correctly - I just wanted to simulate a slow population method as I don't know how loading an ndarray from disk will perform. The important bit was the sorting though, which turned out to be similar in speed to built-in lists - probably because lists are able to use the 128 bit int in native C whereas NumPy (which is otherwise excellent) doesn't do that yet. Although I might not have the optimal answer, I have found something that will work for me - your suggestion helped a great deal. – LocalGeek Oct 22 '18 at 17:13
  • 2
    @LocalGeek: There's no native 128-bit int. Python is using an arbitrary-precision integer implementation. I think the timing difference is probably due to suboptimal structured dtype handling, perhaps related to not being able to optimize the comparison as much. – user2357112 Oct 22 '18 at 17:17
  • Be aware that the timing characteristics change as the size of the dataset increases. For example, on my machine, the NumPy structured array sort beats the Python sort on 6250000 elements, but loses on 62500. I think effects like memory locality start having a bigger impact. – user2357112 Oct 22 '18 at 17:20
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/182289/discussion-between-localgeek-and-user2357112). – LocalGeek Oct 22 '18 at 18:17
0

I was probably expecting too much from Python, but I'm not disappointed. A few minutes of coding allowed me to create something (using built-in lists) that can process the sorting a hundred million uint128 items on an 8GB laptop in a couple of minutes.

Given a large number of items to be sorted (1 trillion), it's clear that putting them into smaller bins/files upon creation makes more sense than looking to sort huge numbers in memory. The potential issues created by appending data to thousands of files in 1MB chunks (fragmentation on spinning disks) are less of a worry due to the sorting of each of these fragmented files creating a sequential file that will be read many times (the fragmented file is written once and read once).

The benefits of development speed of Python seem to outweigh the performance hit versus C/C++, especially since the sorting happens only once.

LocalGeek
  • 95
  • 1
  • 5