7

I need to construct following flow:

  • Accept list of file names
  • Extract multiple lines from those files
  • Process those lines

However I have no idea how to properly inject gather-take into map:

sub MAIN ( *@file-names ) {

    @file-names.map( { slip parse-file( $_ ) } ).map( { process-line( $_ ) } );
}

sub parse-file ( $file-name ) {
    return gather for $file-name.IO.lines -> $line {
        take $line if $line ~~ /a/; # dummy example logic
    }
}

sub process-line ( $line ) {
    say $line;  # dummy example logic
}

This code works but leaks memory like crazy. I assume slip makes gather-take eager? Or slip does not mark Seq items as consumed? Is there a way to slip gather-take result into map in lazy manner?

BTW: My intent is to parallelize each step with race later - so for example I have 2 files parsed at the same time producing lines for 10 line processors. Generally speaking I'm trying to figure out easiest way of composing such cascade flows. I've tried Channels to connect each processing step but they have no built-in pushback. If you have any other patterns for such flows then comments are more than welcomed.

EDIT 1:

I think my code is correct, and memory leak is not caused by bad logic but rather by bug in Slip class. I've created issue https://github.com/rakudo/rakudo/issues/5138 that is currently open. I'll post an update once it is resolved.

EDIT 2: No, my code was not correct :) Check for my post for answer.

Pawel Pabian bbkr
  • 1,139
  • 5
  • 14

2 Answers2

6

I believe that you are mistaken about the cause of the non-laziness in your code – in general, using slip should not typically make code eager. And, indeed, when I run the slightly modified version of your code shown below:

sub MAIN () {
    my @file-names = "tmp-file000".."tmp-file009";
    spurt $_, ('a'..'z').join("\n") for @file-names;

    my $parsed = @file-names.map( { slip parse-file( $_ ) } );
    say "Reached line $?LINE";

    $parsed.map( { process-line( $_ ) } );
}

sub parse-file ( $file-name ) {
    say "processing $file-name...";
    gather for $file-name.IO.lines -> $line {
        take $line if $line ~~ /a/; # dummy example logic
    }
}

sub process-line ( $line ) {
    say $line;  # dummy example logic
}

I get the output that shows Raku processing the files lazily (note that it does not call parse-file until it needs to pass new values to process-line):

Reached 8
processing tmp-file000...
a
processing tmp-file001...
a
processing tmp-file002...
a
processing tmp-file003...
a
processing tmp-file004...
a
processing tmp-file005...
a
processing tmp-file006...
a
processing tmp-file007...
a
processing tmp-file008...
a
processing tmp-file009...
a

Since I don't have the rest of your code, I'm not sure what is triggering the non-lazy behavior you're observing. In general, if you have code that is being eagerly evaluated when you want it to be lazy, though, the .lazy method and/or the lazy statement prefixes are good tools.


Finally, a couple of minor notes about the code you posted that aren't relevant to your question but that might be helpful:

  1. All Raku functions return their final expression, so the return statement in parse-file isn't necessary (and it's actually slightly slower/non-idiomatic).
  2. A big part of the power of gather/take is that they can cross function boundaries. That is, you can have a parse-file function that takes different lines without needing to have the gather statement inside parse-lines – you just need to call parse-lines within the scope of a gather block. This feels like it might be helpful in solving the problem you're working on, though it's hard to be sure without more info.
codesections
  • 8,900
  • 16
  • 50
  • 1
    Thanks for explanation. I have no idea why on long files I have results printed right away (meaning non-lazy manner, just as you observed) and at the same time I get huge memory leak. But you inspired me with this cross boundaries `gather`-`take` idea. Indeed file parser is lines provider, not lines consumer. Maybe I'll figure something out using this method. – Pawel Pabian bbkr Dec 15 '22 at 23:33
  • 1
    Also: `gather for $file-name.IO.lines -> $line { take $line if $line ~~ /a/ }` is just a complicated way of doing `$file-name.IO.lines.map: -> $line { $line if $line ~~ /a/ }`. The latter being about 3x as fast. Whether that matters, of course depends on the complexity of your condition. – Elizabeth Mattijsen Dec 16 '22 at 13:52
  • 1
    @ElizabethMattijsen and, in turn, `$file-name.IO.lines.map: -> $line { $line if $line ~~ /a/ }` is just a complicated way of doing `$file-name.IO.lines.grep(/a/)` :D – codesections Dec 16 '22 at 14:32
  • 1
    I think this is Raku bug: https://github.com/rakudo/rakudo/issues/5138 . Looks like my code is correct but slip leaks memory when used on lazy sequence. – Pawel Pabian bbkr Dec 16 '22 at 17:54
  • 1
    As for simplification suggestions - @Elizabeth Mattijsen is right, but those are just dummy examples. My real logic in file parsing is way more complex. – Pawel Pabian bbkr Dec 16 '22 at 17:58
  • @codesections yes, indeed, but since @Pawel suggested some complicated checking logic, I kept the `.map` – Elizabeth Mattijsen Dec 16 '22 at 18:14
2

First of all - I had big misconception. I thought that all lines produced by parse-file must be slipped into map block like this:

@file-names.map( produce all lines here ).map( process all lines here );

And Slip is a List that tracks all elements. That is why I had big memory leak.

The solution is to create gather-take sequence inside map but consume it outside map:

@file-names.map( { parse-file( $_ ) } ).flat.map( { process-line( $_ ) } );

So now it is:

@file-names.map( construct sequence here ).(get items from sequence here).map( process all lines here );
Pawel Pabian bbkr
  • 1,139
  • 5
  • 14