how to process a large array in perl

Question

I'm beginner in perl, and I'm trying to process a large array. The contents in the array are the rows returned after excuting an SQL query exceute query : select col1, col2, col3, col4 from my_table where rownum < 50000

while (my $h = $sth->fetchrow_hashref) {
 push (@data, $h);
}

for my $row (@data) {
 #process $row
}

Processing 1 element requiress ~1 second, and processing 50k approx 15 hours are required. So, I thought it's a better idea if I store it first.

Is fetchrow_hashref good? Someone told me to use fetchrow_arrayref(faster) or fetchrow_array(fatstest)
I've to process a $row only once. Shall I process the items as
```
for my $row (@data) {
 #process $row
 shift @data;
}
```
Since the array is quite large, I don't think accessing it in list context is a good idea. Is the following better:
```
while (@data) {
 #process $row
 shift @data;
}
```

Don't build `@data` array, but use first while loop to do wanted processing on `$h`. `2.` and `3.` are not correct or not making sense. — mpapec, Dec 17 '13 at 07:31
Processing 1 element requiress ~1 second, and processing 50k approx 15 hours are required. So, I thought it's a better idea if I store it first. — GrSrv, Dec 18 '13 at 10:02

score 4 · Answer 1 · answered Dec 17 '13 at 07:59

4

If you have to process each row only once, and do not need to refer to previous rows while processing a given row, then do not load the entire array into memory. Process the row in the same loop that reads the rows. There's no need to take up the memory to store data you no longer need.

answered Dec 17 '13 at 07:59

Jim Garrison

85,615
20
155
190

Processing 1 element requiress ~1 second, and processing 50k approx 15 hours are required. So, I thought it's a better idea if I store it first. – GrSrv Dec 18 '13 at 09:48
What are you doing that takes 1 second per row? Is that processing CPU-bound or does it do other I/O or database access? – Jim Garrison Dec 18 '13 at 18:26
few db access + occasional cURl/LWP – GrSrv Dec 19 '13 at 06:10

score 0 · Answer 2 · edited Dec 17 '13 at 14:08

If you are using this code:

while (my $h = $sth->fetchrow_hashref) {
 push (@data, $h);
}

for my $row (@data) {
 #process $row
}

Just have DBI pull all the data for you, using selectall_arrayref, this will do for you the code you did on your own.

This code:

my %attrib = ('Slice' => {});
my $refArray = $refDB->selectall_arrayref($SQL, \%attrib, @binders);

Will return $refArray with all your data.

So a complete code will be:

my %attrib = ('Slice' => {});
my $refArray = $refDB->selectall_arrayref($SQL, \%attrib, @binders);
my @Array;
if (defined $refArray) {
 @Array = @{$refArray};
}

foreach my $ptrItem (@Array) {
 my %Item = %{$ptrItem};
 ...
}

I don't think this is heavier than doing your fetchrow_hashref into an array. It will be more memory consuming than doing fetchrow_hashref and handling each row on its own. As the fetchrow uses a MySQL CURSOR to keep track where it is rather than pull all the data to the perl code.

Why make a copy of the 50k rows and put it into `@Array`? You can save the `if (defined...` completely by just going `foreach my $ptrItem (@{ $refArray })` and `my $refArray = $refDB->foo // []`. — simbabque, Dec 17 '13 at 14:12
I like code to be clear, just because perl has shortcuts it doesn't mean they are clear to everyone — Noam Rathaus, Dec 17 '13 at 14:19

score 0 · Answer 3 · edited May 23 '17 at 12:28

Let's see.

Is fetchrow_hashref good? Someone told me to use fetchrow_arrayref(faster) or fetchrow_array(fatstest)

Yes, because you can see the column names in your code. It will save you time later doing maintenance. We are only talking about 50k rows. That is not so many, assuming this is not running twice every minute. Pleae take a look at this question: When to use $sth->fetchrow_hashref, $sth->fetchrow_arrayref and $sth->fetchrow_array?

I've to process a $row only once. Shall I process the items as
for my $row (@data) {
  #process $row
  shift @data;
}

As the others have already said, it never makes sense to save the data first and work with it later. I would do it as follows. That saves you the overhead of creating a copy of the data.

while (my $res = $sth->fetchrow_hashref) { # process $res process($res); }

Since the array is quite large, I don't think accessing it in list context is a good idea. Is the following better:
while (@data) {
  #process $row
  shift @data;
}

I'm not sure what you are talking about here. Perl is meant to deal with large arrays. In fact, 50k rows is not a lot (unless you keep BLOB fields with serialized 5MB jpegs in your DB). There is always more than one way to do it in Perl, and you need to pick the one that works best for you. That will save you time later.

Unless you are feeling it is slow. In that case, you should benchmark. There are multiple ways to do that. The easiest is to use the Benchmark module. It comes with good documentation. If that is not enough, take a look at Devel::NYTProf. You might also come across DBI profiling, but right now that is not what you want.

Also take a look at these really old slides by Tim Bunce, who built the DBI: http://cpansearch.perl.org/src/TIMB/DBI_AdvancedTalk_2004/sld017.htm

Processing 1 element requiress ~1 second, and processing 50k approx 15 hours are required. So, I thought it's a better idea if I store it first. — GrSrv, Dec 18 '13 at 10:00

how to process a large array in perl

3 Answers3