perl regex using too much memory?

Question

I have a perl routine that is causing me frequent "out of memory" issues in the system.

The script does 3 things

1> get the output of a  command to an array   (@arr = `$command`    --> array will hold about 13mb of data after the command)
2> Use a large regex to match the contents of individual array elements  -->

The regex is something like this
if($new_element =~ m|([A-Z0-9\-\._\$]+);\d+\s+([0-9]+)-([A-Z][A-Z][A-Z])-([0-9][0-9][0-9][0-9]([0-9]+)\:([0-9]+)\:([0-9]+)|io) 
<put to hash>
3> Put the array in a persistent hash map.
$hash_var{arr[0]} = "Some value"

edit: Sample data processed by regex are

Z4:[newuser.newdir]TESTOPEN_ERROR.COM;4
                                                    8-APR-2014 11:14:12.58
Z4:[newuser.newdir]TEST_BOC.CFG;5
                                                    5-APR-2014 10:43:11.70
Z4:[newuser.newdir]TEST_BOC.COM;20
                                                    5-APR-2014 10:41:01.63
Z4:[newuser.newdir]TEST_NEWRT.COM;17
                                                    4-APR-2014 10:30:56.11

About 10000 lines like these

I started by suspecting the array and hash together may be consuming too much of memory. However i have started to think this regex might have some thing to do with out of memory as well.

Does perl regex(with 'io' option!) really the main culprit causing out of memory?

Have you tried using a debugger to see how far you get before the out of memory error? And could you provide example data and what you have tried? — hwnd, Apr 30 '14 at 02:55
I am using a openvms system with 32 bit perl image. i am not aware of any debugger other than Devel:size .Open for suggestions. Related question: http://stackoverflow.com/questions/23354220/perl-encounters-out-of-memory-in-openvms-system — kbang, Apr 30 '14 at 03:03
I doubt it's regex, rather your 3>. 10000 unique lines as hash keys, plus "Some value(s)" may be costly. If your lines are not unique, you'll have overlap. Not sure what you really want — cur4so, Apr 30 '14 at 03:47
What is your popurse of using that regular expression? Extract information? Verify input data? Or something else? And why do you believe that regular expression is the reason of your problem? — Lee Duhem, Apr 30 '14 at 03:48
I ran some tests on the size of array and hash. Turns out the max size of array is 13 MB and hash is less than 7 MB (even though scope of hash is outside the function). The purpose of the regex is to input only valid info into the hash. The "Some value" is actually the date in input data processed into posix format. Input to hash is mostly unique as keys are filenames — kbang, Apr 30 '14 at 04:06

score 1 · Accepted Answer · answered Apr 30 '14 at 10:47

1

This has nothing to do with regexes.

If you are operating in a memory-constrained environment, you should process data records one at a time rather than fetching all of them at once. Let's assume you pull your data like:

my @data = `some command`;
for my $line (@data) {
    ... # process the line
}

This is incredibly wasteful because you need storage for the data, and for the output of your processing (in your case: the hash).

Instead, process the input line by line. We can use the open function instead of backticks for this:

open my $cmd, '-|', 'some', 'command' or die "Can't run some command: $!";
while (my $line = <$cmd>) {
    ... # process the line
}

There is no need for an array here, which saves us 13MB of memory which we can now put to use otherwise.

answered Apr 30 '14 at 10:47

amon

57,091
2
89
149

I have done this already and put it to test. Waiting to see if it hits out of memory again So that i can be sure its not the array or hash. – kbang Apr 30 '14 at 11:45
I used $PIPE in open. Is it necessary to close if after my operations? eg: close $PIPE or die ("Not able to close") if <$PIPE>; – kbang May 02 '14 at 19:54
@KiranBangalore yes, pipes should be closed manually: `close $cmd or die "Couldn't close some command: $!"`. See also the `$?` and `${^CHILD_ERROR_NATIVE}` variables which will contain the exit status of the command. – amon May 02 '14 at 20:29

score 0 · Answer 2 · edited May 23 '17 at 10:26

What problem are you really trying to solve? Use your words... not Perl.

Something like: "The script is picking apart the output from an openvms Directory output command and the objective is to report the number of file and oldest date ordered by directory"

First question is WHY keep the array. Will the script 'walk' it again? If not, just processes there and then in a for loop.

The regex seems to pick out out a file-name, and date. That's been does before. It is not hard, and can be simplified by trusting the OpenVMS directory format. Somethign like this reads better imho:

if($new_element =~ m|](.*);\d+\s+(\d+)-(\w+)-(\d+)\s+(\d+):(d+):(\d+)|)

: $hash_var{arr[0]} =

Hmmm, that suggests to me that a whole line from array is used as a key value, with all 50+ spaces. So those 10,000 lines tuning into 1,000,000+ bytes just for raw key bytes. A lot but not crazy. New we know that the first word on the line MUST be unique, why not exploit that: $hash_var{$1} = xxx if /(\S+)/l;

The program may also want to exploit that the leading strings are highly repetitive, and substitute everything before the "]" with an ever increasing directory number, maintained in a 'look-a-side' array and/or hash.

Personally I would drop /NOHEAD from the command, and use a regex to pick up the directories as they come by on their own lines.

Or use a SUBSTR or whatever... of course you'd need to construct a similar key on re-access.

In the related topic, there is debugging output printed. Perhaps include the line number in the array for your own understanding?

Perl encounters "out of memory" in openvms system

Good luck! Hein

perl regex using too much memory?

2 Answers2

Linked