0

Excuse me for useless cat and echo upfront, but while running less on a ~2GB .gz file I'm seeing ~25GB of RAM getting consumed (despite the output being piped into awk and consumed there):

[user@mybox:~]$ cat <(echo '173abcde7665559.90651926
131abcde7298936.49040546
... (25 lines total here) ...
186abcde4858463.43044639
163abcde9409643.80726489'|awk '{print "KEY 1"length($1)-16":"$1}';
less /tmp/stats.gz)|awk '{if("KEY"==$1){K[$2]=1}else{if($8 in K)print}}' >bad25&

I expected above to complete without any need for RAM, but to my surprise here is how it looked like ~2.5h later (by the time it was 89.8% into reading the .gz):

[user@mybox:~]$ ps auxf|grep -e 'pts/2' -e PID |grep -v grep
USER   PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
user 26896  0.0  0.0  68356  1580 pts/2    Ss+  15:23   0:00  \_ /bin/bash
user 27927  0.7  0.0  58932   476 pts/2    S    15:23   1:00      \_ cat     /dev/fd/63
user 27929  0.0  0.0  68356   716 pts/2    S+   15:23   0:00      |   \_ /bin/bash
user 27932 99.9 75.0 22389852 18512388 pts/2 R+ 15:23 137:42      |       \_ less /tmp/stats.gz
user 27933  0.0  0.0  65928  1168 pts/2    S+   15:23   0:00      |           \_ /bin/sh - /usr/bin/lesspipe.sh /tmp/stats.gz
user 27934  1.3  0.0   4176   492 pts/2    S+   15:23   1:52      |               \_ gzip -dc -- /tmp/stats.gz
user 27928  2.1  0.0  63908   776 pts/2    S    15:23   2:56      \_ awk {if("KEY"==$1){K[$2]=1}else{if($8 in K)print}}
[user@mybox:~]$ free -m
             total       used       free     shared    buffers     cached
Mem:         24103      23985        117          0        125       3958
-/+ buffers/cache:      19901       4201
Swap:         8191       7914        277
[user@mybox:~]$ echo 1|awk "{print `cat /proc/27934/fdinfo/3|sed -n 's/pos:[ \t]*//p'`/`du -b /tmp/stats.gz|sed 's/[ \t].*//'`}";date
0.898284
Sat Apr  4 17:41:24 GMT 2015

I'll try out some other options (like rewriting my command with direct gzip -dc or zcat) to see if those could help, but if someone could let know WHY is this happening with less (or any other commands), whether it's a known less or bash bug fixed in a later versions? Maybe some shell tricks to force less to behave properly?

P.S. stats.gz is 25261745373 bytes uncompressed (5 times wrapped around MAX_INT):

[user@mybox:~]$ ls -l /tmp/stats.gz
-rw-r--r-- 1 user users 1837966346 Apr  3 21:42 /tmp/stats.gz
[user@mybox:~]$ gzip -l /tmp/stats.gz
         compressed        uncompressed  ratio uncompressed_name
         1837966346          3786908893  51.5% /tmp/stats
fedorqui
  • 275,237
  • 103
  • 548
  • 598
Vlad
  • 1,157
  • 8
  • 15
  • `less` is intended to be used interactively. It appears that its output is piped into awk here. Why are you even using `less`? – Kenster Apr 04 '15 at 18:53
  • just a "harmless" habit (not so harmless after today's revelation). With proper lesspipe setup - abstracts from any need of caring about whether it's gz/bz2/etc or a raw file (have plenty of IT-enforced routines for gzipping/bzipping everything-older-than-1d on background) – Vlad Apr 04 '15 at 19:00

1 Answers1

5

less stores all data in memory. This is what allows you to scroll up.

that other guy
  • 116,971
  • 11
  • 170
  • 194
  • thanks, sounds reasonable. So `less` is not "smart enough" to see output going into pipe (and not into tty where scrollability is possbile) and thus discard internal buffers? – Vlad Apr 04 '15 at 18:54
  • 1
    Apparently not. There may be options to configure these things, but if you don't want paging you shouldn't be using a pager. – that other guy Apr 04 '15 at 19:26
  • OK, thanks! Time to change my habits and use direct tools instead of the ones I used to use without noticing any overhead :) – Vlad Apr 04 '15 at 20:12