Perl: Create index for tab file with two columns

Question

I have got a huge tab-seperated file with up to 200 million rows (normally around 20 million) and two columns: the first column contains an ASCII word with up to 40 chars the second contains an integer.

I would like to do the following steps:

sort by first column
delete duplicate rows to make all rows unique
read out all rows for given entry in first column

I have a memory limit of 3 GB (so read all data into a hash won't work), unlimited hard disk space and want to run the script on a single core. I am intending to run several scripts in parallel, so the read and write operations on the hard disk shouldn't be to high.

How should proceed with the implementation of my script (in Perl) considering the size of the file?

Which algorithm do you recommend for the first step considering the size of the file?

Step 3 is the most complex part I think. How should I handle this? I am not familiar with indexing algorithms. Could you suggest one that is best for the problem? Are there any Perl modules that I could use?

Does it make sense to first convert the file into a binary file (Like converting SAM to BAM)? If yes, have you got any instructions or algorithms for converting and handling such files?

Why wouldn't you store data like this in a database? – Dave Cross Apr 09 '12 at 14:54 — Dave Cross, Apr 09 '12 at 14:54

Sinan Ünür · Accepted Answer · 2012-04-08T16:44:51.257

1

Reading the entire file into a SQLite database would be my first attempt.

Define the table like:

create table mytuples (
    mykey varchar(40),
    myval integer,
    constraint tuple_pk primary key(mykey, myval) on conflict ignore
);

A simple script using DBI which ignores insert errors should do it.

Untested, and error checks omitted

#!/usr/bin/env perl

use strict; use warnings;
use autodie;

use DBI;

my ($infile) = (@ARGV);

open my $in, '<', $infile;

my $dbh = DBI->connect('dbi:SQLite:some.db', undef, undef, {
        AutoCommit => 0,
        RaiseError => 0,
    },
);

while (my $line = <$in>) {
    my ($key, $val) = split ' ', $line;
    $dbh->do(q{INSERT INTO mytuples VALUES(?, ?)}, undef, $key, $val);
}

$dbh->commit;
$dbh->disconnect;

This may end up slower than sort and grep on the command line for the initial processing, but you may appreciating the flexibility of having SQL at your disposal.

edited Apr 08 '12 at 16:44

answered Apr 08 '12 at 14:14

Sinan Ünür

116,958
15
196
339

A and B are duplicates, if A = B. Will check your code later. Happy easter! – Matthias Munz Apr 08 '12 at 16:19
Happy Easter to you, too. I updated my answer to take into account your comment. – Sinan Ünür Apr 08 '12 at 16:35
I have optimized the previous steps and have now got a SORTED file with again two columns (varchar(40) and int) where only the first has to be indexed. Is there a way to directly load the entries into DB and not by insertion? And does the sorting accelerate the index building? – Matthias Munz Apr 12 '12 at 16:55

score 1 · Answer 2 · answered Apr 08 '12 at 15:27

Use the system sort to sort the file. The latest GNU Sort has a parallel option. Run uniq and then reading the sorted file one line at a time and noticing when the first column changes is easy. The sort uses a sort/merge algorithm which splits the file up into smaller chunks to sort and then merge, so memory is not an issue except for speed as long as you have plenty of disk.

Perl: Create index for tab file with two columns

2 Answers2

Untested, and error checks omitted