How can I extract some data out of the middle of a noisy file using Perl 6?

Question

I would like to do this using idiomatic Perl 6.

I found a wonderful contiguous chunk of data buried in a noisy output file.

I would like to simply print out the header line starting with Cluster Unique and all of the lines following it, up to, but not including, the first occurrence of an empty line. Here's what the file looks like:

</path/to/projects/projectname/ParameterSweep/1000.1.7.dir> was used as the working directory.
....

Cluster Unique Sequences    Reads   RPM
1   31  3539    3539
2   25  2797    2797
3   17  1679    1679
4   21  1636    1636
5   14  1568    1568
6   13  1548    1548
7   7   1439    1439

Input file: "../../filename.count.fa"
...

Here's what I want parsed out:

Cluster Unique Sequences    Reads   RPM
1   31  3539    3539
2   25  2797    2797
3   17  1679    1679
4   21  1636    1636
5   14  1568    1568
6   13  1548    1548
7   7   1439    1439

Christopher Bottoms · Answer 1 · 2015-03-21T14:35:03.590

One-liner version

.say if /Cluster \s+ Unique/ ff^ /^\s*$/ for lines;

In English

Print every line from the input file starting with the once containing the phrase Cluster Unique and ending just before the next empty line.

Same code with comments

.say                    # print the default variable $_
if                      # do the previous action (.say) "if" the following term is true
/Cluster \s+ Unique/    # Match $_ if it contains "Cluster Unique"
ff^                     # Flip-flop operator: true until preceding term becomes true
                        #                     false once the term after it becomes true
/^\s*$/                 # Match $_ if it contains an empty line
for                     # Create a loop placing each element of the following list into $_
lines                   # Create a list of all of the lines in the file
;                       # End of statement

Expanded version

for lines() {
    .say if (
        $_ ~~ /Cluster \s+ Unique/  ff^  $_ ~~ /^\s*$/
    )
}

lines() is like <> in perl5. Each line from each file listed on the command line is read in one at a time. Since this is in a for loop, each line is placed in the default variable $_.
say is like print except that it also appends a newline. When written with a starting ., it acts directly on the default variable $_.
$_ is the default variable, which in this case contains one line from the file.
~~ is the match operator that is comparing $_ with a regular expression.
// Create a regular expression between the two forward slashes
\s+ matches one or more spaces
ff is the flip-flop operator. It is false as long as the expression to its left is false. It becomes true when the expression to its left is evaluated as true. It becomes false when the expression to its right becomes true and is never evaluated as true again. In this case, if we used ^ff^ instead of ff^, then the header would not be included in the output.
When ^ comes before (or after) ff, it modifies ff so that it is also false the iteration that the expression to its left (or right) becomes true.
/^\*$/ matches an empty line
- ^ matches the beginning of a string
- \s* matches zero or more spaces
- $ matches the end of a string

By the way, the flip-flop operator in Perl 5 is .. when it is in a scalar context (it's the range operator in list context). But its features are not quite as rich as in Perl 6, of course.

your one-liner uses bare `say` instead of `.say`; you can also get rid of some more parens by writing it as `.say if /Cluster \s+ Unique/ ff^ /^\s*$/ for lines` — Christoph, Mar 21 '15 at 13:36
It's just a minor improvement over your answer, so I do not think it deserves its own one — Christoph, Mar 21 '15 at 13:46

7stud · Accepted Answer · 2015-03-23T21:29:27.133

I would like to do this using idiomatic Perl 6.

In Perl, the idiomatic way to locate a chunk in a file is to read the file in paragraph mode, then stop reading the file when you find the chunk you are interested in. If you are reading a 10GB file, and the chunk is found at the top of the file, it's inefficient to continue reading the rest of the file--much less perform an if test on every line in the file.

In Perl 6, you can read a paragraph at a time like this:

my $fname = 'data.txt';

my $infile = open(
    $fname, 
    nl => "\n\n",   #Set what perl considers the end of a line.
);  #Removed die() per Brad Gilbert's comment. 

for $infile.lines() -> $para {  
    if $para ~~ /^ 'Cluster Unique'/ {
        say $para.chomp;
        last;   #Quit reading the file.
    }
}

$infile.close;

#    ^                   Match start of string.
#   'Cluster Unique'     By default, whitespace is insignificant in a perl6 regex. Quotes are one way to make whitespace significant.

However, in perl6 rakudo/moarVM the open() function does not read the nl argument correctly, so you currently can't set paragraph mode.

Also, there are certain idioms that are considered by some to be bad practice, like:

Postfix if statements, e.g. say 'hello' if $y == 0.
Relying on the implicit $_ variable in your code, e.g. .say

So, depending on what side of the fence you live on, that would be considered a bad practice in Perl.

You don't need the `or die` in Perl 6, and it is rather pointless as it will never get run. — Brad Gilbert, Mar 22 '15 at 22:20
@BradGilbert, I added that after doing some research and looking at some specs, but now I read that autodie is the default, which is nice. Still playing tennis? — 7stud, Mar 22 '15 at 23:32
@Christopher Bottoms, I don't consider slurping a 10GB file into memory a work around, so I deleted your edit. There are better ways to locate a chunk in a file. — 7stud, Mar 23 '15 at 21:31

How can I extract some data out of the middle of a noisy file using Perl 6?

2 Answers2

One-liner version

In English

Same code with comments

Expanded version