1

I have a process running on Solaris (SunOS m1001 5.10 sun4v sparc) and was monitoring the total virtual memory used.

Periodically running ps showed that the VSZ was growing linearly over time with jumps of 80kbytes and that it keeps growing until it reaches the 4GB limit at which point it's out of address space and things start to fall apart.

while true; do ps -ef -o pid,vsz,rss|grep 27435 ; sleep 5; done > ps.txt

I suspected a memory leak and decided to further investigate with pmap. But pmap shows that VSZ is not growing at all but rather stays stable. Also all file maps, shared memory maps and heap kept the same size.

while true; do pmap -x 27435 |grep total; sleep 5; done > pmap.txt

My first question is: Why do ps and pmap produce a different VSZ for the same process?

I can imagine that heap sizes are calculated differently (e.g. heap usage vs highest heap pointer), so started thinking in the direction of heap fragmentation. I then used libumem and mdb to produce detailed reports about allocted memory at different times and noticed that there was absolutely no difference in allocated memory.

 mdb 27435 < $umem_cmds
 ::walk thread |::findstack !tee>>umemc-findstack.log
 ::umalog !tee>>umem-umalog.log
 ::umastat !tee>>umem-umastat.log
 ::umausers !tee>umem-umausers.log
 ::umem_cache !tee>>umem-umem_cache.log
 ::umem_log !tee>>umem-umem_log.log
 ::umem_status !tee>>umem-umem_status.log
 ::umem_malloc_dist !tee>>umem-umem_malloc_dist.log
 ::umem_malloc_info !tee>>umem-umem_malloc_info.log
 ::umem_verify !tee>>umem-umem_verify.log
 ::findleaks -dv !tee>>umem-findleaks.log
 ::vmem !tee>>umem-vmem.log
 *umem_oversize_arena::walk vmem_alloc | ::vmem_seg -v !tee>umem-    oversize.log
 *umem_default_arena::walk vmem_alloc | ::vmem_seg -v !tee>umem-default.log

So my second question is: what is the best way to figure out what is causing the growing VSZ reported by ps.

  • What specifically do you mean "fall apart"? Run the process under `truss` and see what system calls it's making to get its memory. – Andrew Henle Feb 24 '16 at 11:52

2 Answers2

1

I noticed that this question was still open and wanted to add how this story ended.

After a lot more digging I contacted customer support from Solari and send them a way to reproduce the problem. They confirmed that there was a bug in the kernel which caused this behavior.

Unfortunately I cannot confirm that they rolled out a patch, since I left the company I was working for back then since.

Thx, Jef

0

If you run your suspect process with LD_PRELOAD=libumem.so, then at the point where "it all falls apart" you could gcore it - and then run mdb over it with the umem dcmds such as ::findleaks -dv.

If you look at all the mappings listed in the pmap(1) output rather than just the totals for the process, you'll have a much better idea of where to look. The first thing I look for are the heap, anon and stack segments.

James McPherson
  • 2,476
  • 1
  • 12
  • 16
  • Thanks for your reply. Findleaks -dv doesn't show any leaks. To be fair, I didn't wait until it fell apart around 3.9GB, since it takes rather a long time to get there. pmap -x reports stay stable over time for all mappings, not just the total so that doesn't give me any clue. However, I do see the vsz of ps growing linearly over time. With 1 page per increase, by the looks of it. So my question remains: what is VSZ of ps including that is not monitored by pmap. – Jef de Busser Feb 29 '16 at 18:56
  • Did you look at the heap, anon and stack segment count and sizes? That should tell you a lot. – James McPherson Feb 29 '16 at 20:27
  • I dump the output of pmap -x to a files on 2 different moments in time. The files are binary identical. However in the same period I can see the VSZ of ps going up 1 page. – Jef de Busser Mar 01 '16 at 08:46
  • Same procedure for pfiles, same result. – Jef de Busser Mar 01 '16 at 13:00