you don't need C
or C++
for speed - awk
has plenty :
I created a 957 MB
synthetic file of random integers between 0
and 2^48 - 1
,
plus scrubbing the tail of all even digits (to reduce, but not eliminate, the clumping effect of decimal # digits distribution towards the high side due to rand()
itself being uniformly distributed) :
-- that also means the true min is 1
not 0
# rows | # of decimal digits
5 1
45 2
450 3
4,318 4
22,997 5
75,739 6
182,844 7
382,657 8
772,954 9
1,545,238 10
3,093,134 11
6,170,543 12
12,111,819 13
22,079,973 14
22,204,710 15
… and it took awk
just 6.28 secs
to scan 68.6 mn rows
(70 mn
pre-dedupe) to locate the largest one ::
281474938699775 | 0x FFFF FDBB FFFF
f='temptest_0_2_32.txt'
mawk2 'BEGIN { srand();srand()
__=(_+=++_)^(_^(_+_+_)-_^_^_)
_*=(_+(_*=_+_))^--_
while(_--) { print int(rand()*__) } }' |
mawk 'sub("[02468]+$",_)^_' | uniqPB | pvE9 > "${f}"
pvE0 < "${f}" | wc5; sleep 0.2;
( time ( pvE0 < "${f}" |
mawk2 ' BEGIN { __ = _= (_<_)
} __<+$_ { __ = +$_
} END { print __ }'
) | pvE9 )
out9: 957MiB 0:01:01 [15.5MiB/s] [15.5MiB/s] [ <=> ]
in0: 957MiB 0:00:04 [ 238MiB/s] [ 238MiB/s] [=======>] 100%
rows = 68647426. | UTF8 chars = 1003700601. | bytes = 1003700601.
in0: 15.5MiB 0:00:00 [ 154MiB/s] [ 154MiB/s] [> ] 1% ETA 0:00:00
out9: 16.0 B 0:00:06 [2.55 B/s] [2.55 B/s] [ <=> ]
in0: 957MiB 0:00:06 [ 152MiB/s] [ 152MiB/s] [====>] 100%
( pvE 0.1 in0 < "${f}" | mawk2 ; )
6.17s user 0.43s system 105% cpu 6.280 total
1 281474938699775
At these throughput rates, using something like gnu-parallel
may only yield small gains compared to a single awk
instance.