How can I find the largest number in a very large text file (~150 GB)?

Question

I have a text file that has around 100000000 lines, each of the following type:

string num1 num2 num3 ... num500
string num1 num2 num3 ... num40

I want to find the largest number present in this file.

My current code reads each line, splits it by space, and stores the largest number in the current line. Then, I compare it with the largest number of the next line, and retain the larger of the two.

with open(filename,'r') as f:
    prev_max = -1
    for line in f:
        line = [int(n) for n in line.split(' ')[1:]]
        max = max_num(line)
        if max > prev_max:
            prev_max = max

But this takes forever. Is there a better way to do this?

I am open to solutions with awk or other shell commands as well.

Edit: Added how I am reading the file.

What do you mean by reading normally? Please post a minimal example showing what you actually do with the file. — Mad Physicist, Jan 01 '19 at 06:08
You didn't answer my question. What is `all_lines` specifically? Please post *all* of your code. — alkasm, Jan 01 '19 at 06:13
https://www.blog.pythonlibrary.org/2014/01/27/python-201-an-intro-to-generators/ — xudesheng, Jan 01 '19 at 06:13
What's your I/O subsystem like? If you can have different threads or processes reading a different subset of the file concurrently, that's liable to help a lot; no point to leaving *either* CPU or I/O bandwidth wasted. — Charles Duffy, Jan 01 '19 at 15:33
Basically -- take the size of your file in bytes, halve it, seek to that point, find the location of the next newline, make that your split point, so one thread finds the max of everything before it and one thread finds the max of everything after. Repeat until you've got the workload split into an adequate number of subdivisions. — Charles Duffy, Jan 01 '19 at 15:36
As you have a large number of numbers on each line, there is a fair amount of work to be done per line, so it may be worth using some parallelism/threading as Charles Duffy suggests, because it may not be I/O bound. I would suggest you look at **GNU Parallel** specifically with the `--pipepart` option to chunk the file into as many pieces as you have CPU cores and process them in parallel. If you provide some code that generates representative data with the appropriate number of lines and samples per line, I may (or may not) experiment for you. — Mark Setchell, Apr 02 '19 at 09:18

oguz ismail · Accepted Answer · 2019-08-22T09:19:10.127

4

It's a trivial task for awk.

awk 'NR==1{m=$2} {for(i=2;i<=NF;++i) if(m<$i) m=$i} END{print m}' file

If it's guaranteed that your file is not all zeroes or negative numbers, you can drop NR==1{m=$2} part.

edited Aug 22 '19 at 09:19

answered Jan 01 '19 at 10:02

oguz ismail

1
16
47
69

score 1 · Answer 2 · answered Jan 01 '19 at 10:37

1

Try this Perl solution

$ cat sample1.txt
string 1 2 4 10 7
string 1 2 44 10 7
string 3 2 4 10 70
string 9 2 44 10 7
$ perl -lane ' $m=(sort {$b<=>$a} @F[1..$#F])[0]; $max=$m>$max?$m:$max ; END { print $max } ' sample1.txt
70
$

answered Jan 01 '19 at 10:37

stack0114106

8,534
3
13
38

1

Can use `max` from core [List::Util](https://perldoc.perl.org/List/Util.html) instead of `sort`, for efficiency: `perl -MList::Util=max -lane'$m = max @F; ....` – zdim Jan 01 '19 at 11:10
@zdim..you are right..:-) my office RHEL Perl is throwing error for installing CPAN modules.. so I'll have to live with core modules :-( – stack0114106 Jan 01 '19 at 11:14
Oh, sorry. Can you upgrade? The v5.10.1 is fine but really old at this point. Or, run with perlbrew? – zdim Jan 01 '19 at 11:17
yeah..it is old.. If I'm admin, I can do that.. that will take a long time.. btw if you have time can you try questions/53706983 using Perl.. – stack0114106 Jan 01 '19 at 11:22

James Brown · Answer 3 · 2019-01-01T15:31:16.937

I wanted to write an awk script without for looping the columns to compare execution times with a for looped solution such as @oguzismail's trivial. I created a million records of 1-100 columns of data, values between 0-2^32. I played around with RS to only compare columns 2-100 but as that required regex it slowed down the execution. Much. Using tr to swap space and newlines I got pretty close:

$ cat <(echo 0) file | tr ' \n' '\n ' | awk 'max<$1{max=$1}END{print max}'

Output of cat <(echo 0) file | tr ' \n' '\n ':

0 string1
1250117816
3632742839
172403688 string2
2746184479
...

The trivial solution used:

real    0m24.239s
user    0m23.992s
sys     0m0.236s

whereas my tr + awk spent:

real    0m28.798s
user    0m29.908s
sys     0m2.256s

(surprisingly, if I first preprocessed the data with the tr to a file and then read it with awk it wouldn't be faster, most of the time slower actually)

So, then I decided to test my rusty C skills to set some kind of baseline (the man pages are pretty good. And Google.):

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(void)
{
  FILE * fp;
  char * line = NULL;
  char * word = NULL;
  size_t len = 0;
  ssize_t read;
  long max=0;
  long tmp=0;

  fp = fopen("file", "r");
  if (fp == NULL)
    exit(EXIT_FAILURE);
  while ((read = getline(&line, &len, fp)) != -1) {
    if((word = strtok(line," "))!=NULL) {
      while(word != NULL) {
        if((word = strtok(NULL," "))!=NULL) {
          tmp=strtol(word,NULL,10);
          if(max<tmp) {
            max=tmp;
          }
        }
      }
    }
  }
  fclose(fp);
  printf("%ld\n",max);
  exit(EXIT_SUCCESS);
}

Result of that:

$ time ./a.out 
4294967292

real    0m9.307s
user    0m9.144s
sys     0m0.164s

Oh, using mawk instead of gawk almost halved the results.

not an expert on C but I would mess around with mmap. see: https://paste.ubuntu.com/p/8Q2SpjGTX5/ — oguz ismail, Jan 01 '19 at 22:04

score 0 · Answer 4 · answered Dec 09 '22 at 03:09

you don't need C or C++ for speed - awk has plenty :

I created a 957 MB synthetic file of random integers between 0 and 2^48 - 1,

plus scrubbing the tail of all even digits (to reduce, but not eliminate, the clumping effect of decimal # digits distribution towards the high side due to rand() itself being uniformly distributed) :

-- that also means the true min is 1 not 0

    # rows  | # of decimal digits

           5 1
          45 2
         450 3
       4,318 4
      22,997 5
      75,739 6
     182,844 7
     382,657 8
     772,954 9
   1,545,238 10
   3,093,134 11
   6,170,543 12
  12,111,819 13
  22,079,973 14
  22,204,710 15

… and it took awk just 6.28 secs to scan 68.6 mn rows (70 mn pre-dedupe) to locate the largest one ::

281474938699775 | 0x FFFF FDBB FFFF

f='temptest_0_2_32.txt'

mawk2 'BEGIN { srand();srand()
 
   __=(_+=++_)^(_^(_+_+_)-_^_^_)
   _*=(_+(_*=_+_))^--_

   while(_--) { print int(rand()*__) } }' | 

mawk 'sub("[02468]+$",_)^_' | uniqPB | pvE9 > "${f}"

pvE0 < "${f}" | wc5; sleep 0.2; 

( time ( pvE0 < "${f}" | 

  mawk2 '  BEGIN  { __ = _= (_<_)
         } __<+$_ { __ = +$_ 
         } END    { print __ }' 

) | pvE9 )

    out9:  957MiB 0:01:01 [15.5MiB/s] [15.5MiB/s] [ <=> ]

    in0:  957MiB 0:00:04 [ 238MiB/s] [ 238MiB/s] [=======>] 100% 
      
  rows = 68647426. | UTF8 chars = 1003700601. | bytes = 1003700601.

     in0: 15.5MiB 0:00:00 [ 154MiB/s] [ 154MiB/s] [> ]  1% ETA 0:00:00
    out9: 16.0 B 0:00:06 [2.55 B/s] [2.55 B/s] [ <=> ]
     in0:  957MiB 0:00:06 [ 152MiB/s] [ 152MiB/s] [====>] 100%   
     
( pvE 0.1 in0 < "${f}" | mawk2 ; )  

6.17s user 0.43s system 105% cpu 6.280 total

     1  281474938699775

At these throughput rates, using something like gnu-parallel may only yield small gains compared to a single awk instance.

How can I find the largest number in a very large text file (~150 GB)?

4 Answers4