Finding k-largest elements of a very large file (while k is very LARGE)

Question

Let's assume that we have a very large file which contains billions of integers , and we want to find k largest elements of these values ,

the tricky part is that k itself is very large too , which means we cannot keep k elements in the memory (for example we have a file with 100 billon elements and we want to find 10 billion largest elements)

How can we do this in O(n) ?

What I thought :

We start reading the file and we check it with another file which keeps the k largest elements (sorted in increasing order) , if the read element is larger than the first line of the second file we delete the first line and we insert it into the second file , the time complexity would be of O(NlogK) (if we have random access to that file , otherwise it would be 'O(Nk)'

Any idea to do this in O(n) , I guess if we have external version of Selection algorithm (the partitioning algorithm in quicksort) we would be able to do this in O(n) but I couldn't find it anywhere

Do you want every number of the k to be unique in the result? With other words, let k be 3 and the file 1,2,3,5,3,2,4,3,5 should the result be 5,5,4 or 5,4,3 ? — MrSmith42, Jul 01 '13 at 19:12

score 11 · Answer 1 · answered Jul 05 '13 at 04:59

You can do this pretty easily with a standard merge type algorithm.

Say you have 100 billion numbers and you want the top 10 billion. We'll say you can hold 1 billion numbers in memory at any time.

So you make a pass:

while not end of input
    read 1 billion numbers
    sort them in descending order
    save position of output file
    write sorted numbers to output file

You then have a file that contains 100 blocks of 1 billion numbers each. Each block is sorted in descending order.

Now create a max heap. Add the first number of each block to the heap. You'll also have to add the block number or the number's position in the file so that you can read the next number.

Then:

while num_selected < 10 billion
    selected = heap.remove()
    ++num_selected
    write selected to output
    read next number from the selected block and place on heap

There's a small bit of complexity involved, keeping track of which block the number came from, but it's not too bad.

The max heap never contains more than 100 items (basically, one item per block), so memory isn't an issue in the second pass. With a bit of work, you can avoid a lot of reads by creating a smallish buffer for each block so that you don't incur the cost of a disk read for every number that's selected.

It's basically just a disk merge sort, but with an early out.

Complexity of the first pass is b * (m log m), where b is the number of blocks and m is the number of items in a block. N, the total number of items in the file, is equal to b * m. Complexity of the second pass is k log b, where k is the number of items to select and b is the number of blocks.

The bottleneck here is : sorting 1 billion numbers , but in my own approach this is the same case too , so it's a good approach but more complicated — Arian, Jul 05 '13 at 14:30
@ArianHosseinzadeh: Whereas it's true that your algorithm is O(N log k), your constant factors are huge if you're trying to maintain a heap on disk. Worst case, you'll do N sequential reads, k sequential writes, N*(log k) random reads and N*(log k) random writes. The average case is somewhat better: when k is .1*N, the number of items added to the heap will be about .06*N. This algorithm, you do N sequential reads, k random reads, and (N + k) sequential writes. The decrease in disk I/O might very well make up for the difference in sorting time. — Jim Mischel, Jul 05 '13 at 15:21
For a little info on the number of items actually added to the heap, see the discussion in http://blog.mischel.com/2011/10/25/when-theory-meets-practice/ — Jim Mischel, Jul 05 '13 at 15:22

ElKamina · Accepted Answer · 2013-07-01T20:48:29.047

4

PS: My definition of K is different. It is a smallish number say 2 or 100 or 1000. Here m corresponds to OPS's definition of k. Sorry about this.

Depends on how many reads you can do of the original data and how much more space you have. This approach assumes you have extra space equivalent to the original data.

Step 1: Pick K random numbers across the whole data
Step 2: Sort the K numbers (assume index are from 1 to K)
Step 3: Create K+1 separate files and name them 0 to K
Step 4: For every element in the data, if it is between ith and i+th element put it in ith file.
Step 5: Based on the size of each file, choose the file that is going to have mth number.
Step 6: Repeat everything with the new file and new m (new_m = m - sum_of_size_of_all_lower_files)

Regarding the last step, if K=2, m=1000 and size of file 0 is 800, 1 is 900 and 2 is 200, new_m = m-800 = 200 and work through file 1 iteratively.

edited Jul 01 '13 at 20:48

answered Jul 01 '13 at 18:59

ElKamina

7,747
28
43

Nice approach , but I guess the constant of the time complexity should could be very large (as we are doing lots of reads) , and what are `m` and `k` here ? – Arian Jul 01 '13 at 20:06
@ArianHosseinzadeh Sorry about this. I have updated my definition of K and m. Yes, the number of reads is large (logn), but in every iteration you read smaller files. Eg. If you set K to 1000, in the second step you read only roughly 1/1000th of the original size. – ElKamina Jul 01 '13 at 20:50
This is a natural extension of [this](http://en.wikipedia.org/wiki/Selection_algorithm#Partition-based_general_selection_algorithm). I'm thinking the optimal would probably be to determine the maximum K we can fit into memory. – Bernhard Barker Jul 01 '13 at 20:52
@ElKamina: what is the use of K+1 th file ? i think all the elements will go into 0 to k-1 th files ? another doubt - files 0 ..k-1 will have their 1st element as the selected k elements ? Please clarify . – Aseem Goyal Feb 22 '14 at 13:21
@aseem All numbers greater than kth value will go to k+1 file. – ElKamina Feb 22 '14 at 17:55
@ElKamina: let random numbers are 4 , 10 , 15 then 0-3 will be in file0 , 4-14 be in fil1 and 15-INFINITY in file2 . Am i understanding wrongly ? – Aseem Goyal Feb 22 '14 at 17:59
0-3 in file0, 4-9 in file1, 10-14 in file2, 15-infinity in file3 – ElKamina Feb 22 '14 at 19:05

score 4 · Answer 3 · answered May 13 '15 at 14:17

you can do this by maintaining a min heap of max size k.

Every time a new number arrives - check if the heap is smaller then k, if it is - add it.
If it is not - check if the minimum is smaller then the new element, and if it is, pop it out and insert the new element instead.

When you are done - you have a heap containing k largest elements. This solution is O(nlogk) complexity, where n is the number of elements and k is the number of elements you need .

It can be done also in O(n) using selection algorithm - store all the elements, and then find the (k+1)th largest element, and return everything bigger then it. But it is harder to implement and for reasonable size input - might not be better. Also, if stream contains duplicates, more processing is needed

You cannot maintain the min heap internally in memory as I mentioned the tricky part is that K is very large. — Arian, May 13 '15 at 17:27

MrSmith42 · Answer 4 · 2016-07-29T06:32:18.853

2

If all values are distinct or we can ignore doublets and we have 32bit integers, I would simply use one bit per possible value (needs 2^32 bits = 2^29 bytes = 512 megabytes (should fit in your RAM)).

Initialize the 512MB with 0
While linear reading the file ( O(n) ) set the corresponding bit for each read value.
At the end look for the first k set bits to get the k largest values. ( O(2^32) bit tests)

If the values are not distinct and you want to know how often the values occur, you can add a 4th step where you read the file again and count the number of occurrences of the values found in the first 3 steps. That is still O(n).

edited Jul 29 '16 at 06:32

answered Jul 01 '13 at 20:08

MrSmith42

9,961
6
38
49

Only up to a value of 4 billion fits into a 32-bit integer, so, for, 100 billion, there'd be a ton of duplicates. Ignoring duplicates when finding the kth biggest value would be a very strange requirement. – Bernhard Barker Jul 01 '13 at 20:33
But it is a valid requrement. I asked the autor of the question (see comment at the question). (You may want to know which the k different biggest values are) – MrSmith42 Jul 02 '13 at 09:33

score 0 · Answer 5 · answered Jun 28 '21 at 22:59

We can use the PriorityQueue with size 'k'.

Keep adding the values in the PriorityQueue.
If the size becomes more than k then remove the 1st element. PriorityQueue sorts in the ascending order by default.
After all the elements have been added in the PriorityQueue, we can later pop the element to get the k-th largest element.

score -1 · Answer 6 · answered Jul 01 '13 at 18:02

-1

Use randomised selection to find the kth largest element in the file. You can do this in linearly many passes over the input as long as it's not too ridiculously many times larger than memory. Then just dump out everything that's at least as large as it.

answered Jul 01 '13 at 18:02

tmyklebu

13,915
3
28
57

what do you mean by `randomised` selection? do you mean `randomly` accessing the file ? (e.g. using `RandomAccessFile` in Java) – Arian Jul 01 '13 at 18:05
Take a small random sample of the array. Find the median of the sample. Filter out, or store away, everything in the array that's on the wrong side of the median. Repeat until the input array is small enough. – tmyklebu Jul 01 '13 at 18:10
As in the question , if you want to find the `1 billion-th` largest element , what I understand from you approach is that we have to first find (e.g.) the 100-th largest element in smaller chunks of files and then combine them , it's possible but you may lose some data which are wanted assume that main file is sorted in increasing order then you are going to eliminate some data which can be in the range wanted . – Arian Jul 01 '13 at 18:15

Finding k-largest elements of a very large file (while k is very LARGE)

6 Answers6

Linked