Best way to list top 5 large numbers from a huge text file

Question

Trying to find best and simple way to list the top 5 numbers from a 150G text file.

The file I was searching has only numbers in in each line as below.

Tried the below program and still it displayed only the 1st numeric digits from the numbers and not the complete number.

from heapq import nlargest

data=open('number.txt','r')
text=data.read()
print (text)
print nlargest(5, (text))
data.close()

Any other way to pick the top 5?

You are not treating your data as numbers. You have *strings*, and those are sorted lexicographically, so the 'largest' number can be `'9'` as `'9'` sorts before `'888888'`. — Martijn Pieters, Sep 03 '18 at 10:30
You are also feeding a *single string* to `nlargest()`, so it'll process *individual characters* giving you a series of `'9'` characters. — Martijn Pieters, Sep 03 '18 at 10:31

Martijn Pieters · Answer 1 · 2018-09-03T10:53:02.397

You are not treating your data as numbers. You are instead passing the whole file content (a very large string) to nlargest(), which can't do anything but give you the last characters in lexicographical order. The character '9' sorts after the character '8' in this case.

You need to a) read your input line by line, not as one big string, and b) convert your data to integers so they are compared by numeric values:

from heapq import nlargest

def as_numbers(it):
    for line in it:
        try:
            yield int(line)
        except ValueError:
            # not a line with a number, skip
            continue

with open('number.txt') as data:
    five_largest = nlargest(5, as_numbers(data))                
    print(five_largest)

I've used a generator function to convert lines to integers here, because that'll make it easier to keep using heapq.nlargest() (which is absolutely the right tool to use for this job, as it can efficiently keep the top-n values available in O(NlogK) time, so for a fixed K=5 items that's basically linear and only proportional to the number of integer values in the file). The generator function takes care of conversion to int(), skipping any lines that can't be converted.

Note also the use of with with the opened file object; at the end of the with block the file is automatically closed for you, no need to call data.close() explicitly here. This will even if there was an exception too!

Demo:

>>> from heapq import nlargest
>>> from io import StringIO
>>> random_data = '\n'.join([str(random.randrange(10**6)) for _ in range(10000)])  # 10k lines of random numbers between 0 and 1 million
>>> random_data.splitlines(True)[1042:1045] # a few sample lines
['39909\n', '15068\n', '420171\n']
>>> sample_file = StringIO(random_data)  # make it a file object
>>> def as_numbers(it):
...     for line in it:
...         try:
...             yield int(line)
...         except ValueError:
...             # not a line with a number, skip
...             continue
...
>>> nlargest(5, as_numbers(sample_file))
[999873, 999713, 999638, 999595, 999566]

I'm not familiar with `heapq` at all, but if it takes any iterable then I agree this is the right method (it could have only taken a list which would would require you to read the whole file into memory!) — Joe Iddon, Sep 03 '18 at 10:37
@JoeIddon: `heapq.nlargest()` takes an iterable, and a list of 5 elements (the heap) is kept up to date using [heap operations](https://en.wikipedia.org/wiki/Heap_(data_structure)). A heap doesn't require a full sort (heap order is not quite sorted order, but can be sorted efficiently at the end). — Martijn Pieters, Sep 03 '18 at 10:40
@JoeIddon: whenever you need to pick a TOP-N or LEAST-N of a series, use `heapq`; `Counter.most_common()` does whenever you ask for a subset of the counter, for example. — Martijn Pieters, Sep 03 '18 at 10:41

score -1 · Answer 2 · answered Sep 03 '18 at 10:40

-1

Input:

Code:

data=open('number.txt','r')
text=data.readlines()#read the file line to line and introduce in a list of string
numbers = map(int, text)#convert the list of string in list of int
numbers.sort()#sort your list
print (numbers[-5:])#print the 5 largest
print (numbers[:5])#print the 5 smaller

Result:

[235423, 563456, 4509876, 47345734, 456789876]
[3, 36, 656, 6234, 7348]

answered Sep 03 '18 at 10:40

Carlo 1585

1,455
1
22
40

1

This is very inefficient for two reasons: a) you read the whole file into memory, so now you require gigabytes of working memory, and b), sorting the whole list just to find the top 5 or least 5 is a huge waste of processing time. You never need to know the exact order of elements not at either end, just that they are not part of those subsets. The OP was already using the much more efficient `heapq` module for this task, sorting will take O(NlogN) time while the heapq will take O(NlogK) time. For 1 million elements, that's a factor of 8 times longer! – Martijn Pieters Sep 03 '18 at 10:44
Add to that that using the `heapq.nlargest()` only requires the running top-5 elements to be kept in memory at any one time, and you'll find that using sorting instead is going to be even slower as the OS has to swap in memory for your version vs only handle buffering and a handful of integers in the heapq case, so the total memory size will be mostly occupied by Python itself rather than 150GB worth of numbers in a list. – Martijn Pieters Sep 03 '18 at 10:46
tks @MartijnPieters I never had to use before heapq.nlargest() so I had no idea of this ;) but it's good to know ;) :p – Carlo 1585 Sep 03 '18 at 10:48
It's always good to learn about some basic CS algorithms and understand their time and space complexity (use a [cheat sheet, no need to memorise specifics](http://bigocheatsheet.com/)). Sort vs heap is one of those fundamental building blocks. – Martijn Pieters Sep 03 '18 at 10:59
@MartijnPieters very nice ;) tks so much – Carlo 1585 Sep 03 '18 at 11:01
(and I have yet to come across a machine that'll let you read 150GB into a Python list of strings, in memory, use `map()` to create a list of integers *in addition to the string*, then confidently let you sort that list. If you don't outright run into a memory error (almost certainly) you'll be waiting a *long time* for all the memory swapping to complete.) – Martijn Pieters Sep 03 '18 at 11:03
(On my 64-bit system, a Python list uses 8 bytes per element. The strings are reasonably efficient, 1 byte per character, but 150GB means 150GB of memory needed to store all that, plus an element for every line in the list; if we take the question sample as representative the lines average 7 characters each, so you'd need 322GB, *roughly*, for the first list, then another 24 bytes per integer, so another 686GB for the integer list. That's nearly 1TB of memory, and we'll need to sort the 686GB section!) – Martijn Pieters Sep 03 '18 at 11:11

Best way to list top 5 large numbers from a huge text file

2 Answers2