2

The details of my query follow:

  1. I have a very large TSV (Tab Sep. Value) file (where large > 30 GB).
  2. I want to extract certain lines from this file that don't end with an empty last field. Since this is a TSV file, those lines that don't end with \t\n, which is a trivial test and not the subject of this question. That will remove some 75% of the lines, right off the bat, reducing the workload.
  3. Then I want to extract a small subset of fields from the remaining lines. The fields are not contiguous, but they are few in number (e.g., let's say seven out of just over thirty in total). For example, say fields 2,3,12-18,25-28,31.
  4. The lines I am extracting from are very long, most are as long as 1,000 characters, because they contain a large number of tab delimited fields.

One option is to obviously use the following simple code, which I've tried to nicely format and include comments to show my reasoning:

use warnings;
use strict;
# I am using the latest stable version of Perl for this exercise
use 5.30.0;

while (<>)
{
  # Skip lines ending with an empty field
  next if substr($_,-2) eq "\t\n";

  # Remove "\n"
  chomp;

  # Split matching lines into fields on "\t", creating @fields
  my @fields=split(/\t/,$_);

  # Copy only the desired fields from @fields to create a new
  # line in TSV format
  # This can be done in one simple step in Perl, using
  # array slices and the join() function
  my $new_line=join("\t",@fields[2,3,12..18,25..28,31]);

  # ...
}

But, using split leads to extra parsing (beyond the last field I need) and produces a complete array of fields which I also don't need. I think it would be more efficient to not create the array, but to parse each line looking for tabs and counting the field indexes as I go, creating the output line on the way, and stopping at the last field I need.

Am I correct in my assessment, or is just doing a simple split, followed by a join of the slices containing the fields of interest, the best way to go here from a performance perspective?

Update: Unfortunately, no one mentioned the possibility of using GNU cut for the split and piping the results into Perl for the rest of the processing. This is probably the most performant way, without writing lots of custom (C) code to do this or resorting to large block based reads with custom line parsing (also in C).

Michael Goldshteyn
  • 71,784
  • 24
  • 131
  • 181
  • 2
    Crossposted to [PerlMonks](https://www.perlmonks.org/?node_id=11101571). – choroba Jun 19 '19 at 15:22
  • 1
    I would use `split` unless benchmarking indicated it was a bottleneck. Or a CSV parsing module. – Shawn Jun 19 '19 at 15:59
  • split takes forever on a file this large. I was hoping that there is perhaps a faster alternative approach to cut down on the wall time to process the file. – Michael Goldshteyn Jun 19 '19 at 16:17
  • How do you know it's split that takes forever? How long does it take just to read the file line by line? – choroba Jun 19 '19 at 16:25
  • @choroba, that part is very fast, as is the filtering for lines ending in empty fields. This is a very powerful machine, but the unnecessary creation of arrays and parsing with split add enough overhead to make it slow. – Michael Goldshteyn Jun 19 '19 at 16:38
  • 1
    The IO is the bottleneck. You might be able to speed that up by reading a bug chunk using `sysread` instead of reading a line at a time. (I say "might" because you now have the extra task of extracting the lines.) – ikegami Jun 19 '19 at 18:35
  • @ikegami if there are 200 fields in a line and 8 million split calls, all those extra SVs being created can make quite a difference. on my machine, limiting the split saves 3 minutes – ysth Jun 19 '19 at 22:26
  • @ysth I find the same -- to the point that `Text::CSV` which I tried is an order of magnitude slower (x20) in a benchmark (wrote a tsv file with 260 fields per line). I tried `->fragment` in newer `Text::CSV`, which does give me back only wanted columns -- but is even slower – zdim Jun 20 '19 at 00:14
  • @ysth (I mean, I think `Text::CSV` is slower only because it has to take the whole lines. Whenever I timed it against manual splitting it was 50%-ish-or-so faster.) – zdim Jun 20 '19 at 00:20
  • @zdim, Why would you ever use Text::CSV over Text::CSV_XS? – ikegami Jun 20 '19 at 04:01
  • @ysth, The OP said there were only 30 fields (though they said they wanted the 31st) – ikegami Jun 20 '19 at 04:03
  • @ikegami oups, yes, I meant `Text::CSV_XS` (It's _always_ set up but I always use the wrapper `use` statement). That's a good way to put it, I wouldn't _ever_ want to :)) – zdim Jun 20 '19 at 04:03
  • @ikegami "_OP said there were only 30 fields_" -- I think they meant that they are picking out of the fist 30 or so but there's ... a few hundred? – zdim Jun 20 '19 at 04:05
  • @zdim, They said they were picking about 7 out of 30 – ikegami Jun 20 '19 at 04:05
  • @ikegami for some reason I thought ~100s ... looking for it ... – zdim Jun 20 '19 at 04:07
  • @zdim, And on PM, they also said they wanted one of of the last fields. – ikegami Jun 20 '19 at 04:08
  • 1
    That leaves: Reducing IO times, reusing scalars, and custom parser (though `split` is quite fast) – ikegami Jun 20 '19 at 04:08
  • OP here. There are 34 fields, I misspoke when I said thirty. – Michael Goldshteyn Jun 21 '19 at 18:34

2 Answers2

5

You can tell split when to stop with its limit parameter:

my @fields=split(/\t/,$_,33);

(Specify one more than the number of fields you actually want, because the last field it produces will contain the remainer of the line.)

ysth
  • 96,171
  • 6
  • 121
  • 214
  • I'll give you the checkmark, for now, but I was hoping to see alternative more performant approaches to split. – Michael Goldshteyn Jun 21 '19 at 18:34
  • if there are only 34 fields, this is not going to make a significant difference. a regex match might give some improvement. So you have lines that are about 1000 characters and 34 fields in each line and maybe 30 million lines, of which 75% are ignored because of the empty field at the end? – ysth Jun 21 '19 at 20:05
0
grep -P -v "\t\s*$" yourFile.tsv | cut -f2,3,12-18,25-28,31

You don't even have to write a perl code for this.

Here,

-P is "perl grep" which provides more functionality to naive grep.

-v is inverse matching, which corresponds to your next if

BTW, if you have enough cores and memory, then you might want to speed up the process by split and merge as:

split -n 10 -d yourFile.tsv yourFile.tsv.

That will generate yourFile.tsv.00, ..., yourFile.tsv.09

Thus, the whole code looks like something like the block in the below:

`split -n 10 -d yourFile.tsv yourFile.tsv.`
@yourFiles = `ls yourFile.tsv.*`;
foreach $file (@yourFiles) {
      `grep -P -v "\t\s*$" $file | cut -f2,3,12-18,25-28,31 > $file.filtered &`;
}
`cat yourFile.*.filtered > final.output.tsv`
JUNPA
  • 220
  • 2
  • 6