0

I have inherited a Perl script that runs on an EC2 instance which basically crawls a bunch of URL's for data (aka scraping). The way this script is invoked is via a shell script that forks multiple of these perl scripts. There could be hundreds of these Perl scripts running at any given point depending on the scraping progress.

Each Perl script does this:

## create a temporary hash to hold everything ##
my %products = ();

and as you can imagine, that array grows as more products are scraped within that process.

My question is this: what happens when perl tries to add the next product to the 'product' array and there isn't memory available? Does it simply wait or does it die? My gut tells me it dies but how can I use a malloc style memory allocation where if it can't allocate memory it waits?

Is it better to to just limit the number of child processes?

Any ideas would be greatly appreciated.

p.s. This is perl, v5.10.1 (*) built for i486-linux-gnu-thread-multi

Etienne
  • 103
  • 3

1 Answers1

1

Not sure about the specifics of Perl, but in other dynamic languages such as Python you would get a memory allocation failure and a subsequent crash of your program. Some languages (Python included) allow you to install a handler for the condition, Perl likely does the same.

I'm not sure where you get the idea that malloc waits when it cannot allocate memory, the implementation on Linux either returns a pointer or NULL if the request fails.

The situation on Linux is complicated further by the fact that Linux allows memory overallocation by default. For example if your system has 4 GB of virtual memory available, yo can have multiple processes allocate nearly 4GB of memory. It's until they dirty the allocation that the memory is actually used. If multiple processes end up doing this, they will exhaust the actual available memory and the Out Of Memory killer process will kick in and kill some processes.

The simple solution for you would be to just watch how much memory your processes use and only allow a certain number to run at a time. More complex solutions involve using fixed-length data structures so the memory usage is known, or streaming the results to disk either directly or via a buffer to keep the usage low. The solution really depends on the application and it's hard to propose something more concrete without details of its function.

Kamil Kisiel
  • 12,184
  • 7
  • 48
  • 69
  • Thank you Kamil. This is most helpful. What I meant to say with 'malloc waits when it cannot allocate' was that one could easily write a routine to attempt to allocate memory with malloc, and if the pointer returned was NULL, the process could just wait until it was successful. But anyway, thank you for confirming my suspicions. – Etienne Mar 28 '11 at 22:43