0

We have a Windows Server 2016 with around physical 700GB RAM. A colleague of mine ran a machine learning script in Matlab, that loaded 25GB of data in RAM and during the training the RAM usage increased up to 350GB (usual behavior for many AI algorithms during training). This led to a big drop in performance for many other people (including the colleague who did this). He tried stopping it by force stoping the Matlab process tree (one node only) from the Task Manager.

The process is still "stopping" 2 hours later. We noticed that the RAM usage is gradually dropping but with around 200KB/s. Restarting the machine is currently not possible.

Any idea what is going on here? Normally killing a process should go past gentle shutdown procedures. At least this is my experience.

Update: a day later and the Matlab process has increased the RAM usage to 357GB

rbaleksandar
  • 113
  • 1
  • 8
  • “Normally killing a process should go past gentle shutdown procedures.” - The system has no memory to perform that operation since all physical and virtual memory was used by a single process. Seem strange it was able to use more than 5x the amount of memory that the system has. – Ramhound Apr 13 '23 at 13:27
  • It's Skynet emerging! Remove the power plug! ;-) It's probably a recursive problem. The machine learning process has created many sub-processes, which also create further sub-processes, killing them all at once is not possible, as the processes are unknown, so one after one process is being killed, while new processes are generated at the same time. Next time use Matlab within a VM and kill that VM. – paladin Apr 13 '23 at 14:12
  • Unless the machine actually had 350GB of physical RAM it probably wrote all of that used memory out to the page file and ballooned the page file out to 350+ GB as well. Now it is having to read in every page of memory that got paged out, invalidate it and then release it from the page table stored in memory. Whoever administrates this server should set some kind of sensible upper limit to page file size. – Mokubai Apr 13 '23 at 15:43
  • The question doesn't seem to have been edited and it clearly states that the server actually has 700GB of RAM so the script consumed about half of that, no reason for the pagefile to be involved (from what is stated in the question). – Ginnungagap Apr 14 '23 at 06:28
  • It's a server. It has a lot of memory (700GB physical). I will add the info to the question so that there is no confusion. Also one day later the memory usage is still 49-50%. :D – rbaleksandar Apr 14 '23 at 08:18
  • How was the process killed? `taskkill /f /im:foo.exe` will force it, and may be quicker. How much swap is used? – vidarlo Apr 14 '23 at 08:52
  • The amount of memory is meaningless if it is fragmented. You would need a memory dump to analyze, but this is beyond your capability. `Restarting the machine is currently not possible`. Then use another server. – Greg Askew Apr 14 '23 at 09:05

1 Answers1

0

You probably killed the main process without killing the entire process tree. I suggest using ProcessExplorer to discover which processes are using RAM the most and then killing the entire process tree.

shodanshok
  • 47,711
  • 7
  • 111
  • 180
  • Actually I went for the whole process tree (as stated in the question) even though in the tree there was just a single node. I did also try ProcessExplorer. The process currently cannot be stopped. Windows refuses to do so with the respective error message, which usually occurs whenever you try to stop something that is already in the process of stopping. – rbaleksandar Apr 14 '23 at 12:04
  • Can you share the precise error message (or a screenshot)? – shodanshok Apr 14 '23 at 13:16