DBI::CSV Implementation Based on Sqlite

Question

I deal a lot with text files, comparing one to another in a "SQL manner".

DBD::CSV is obviously a good choice to start with, as I can use the power of SQL syntax on text tables. However, I deal with huge text files, making DBD::CSV useless in terms of performance.

So, I was starting to write a module that converts csv files to sqlite DB, and then returns DBI::sqlite object that I can play with. The thing is, that converting a text file to sqlite table can also be not very efficient, that because I can not run the sqlite command line from perl to load CSV files quickly (using .load). So I have to create a huge Insert into string based on the text tables, and execute it (executing Insert line by line is very inefficient in terms of performance, so I preferred executing a one big insert). I'm willing to avoid that, looking for a one-liner to load csv to sqlite using perl.

And another thing, I used the following functions to execute and print nicely a SQL query:

sub sql_command {
my ($self,$str) = @_;
my $s = $self->{_db}->prepare($str) or die $!;
$s->execute() or die $!;
my $table;
push @$table, [ map { defined $_ ? $_ : "undef" } @{$s->{'NAME'}}];
while(my $row = $s->fetch) {
    push @$table, [ map{ defined $_ ? $_ : "undef" }@$row ];
}
box_format($table);
return box_format($table);;
}


sub box_format {
my $table = shift;
my $n_cols = scalar @{$table->[0]};

my $tb = Text::Table->new(\'| ', '', (\' | ','')x($n_cols-1), \' |+');
$tb->load(@$table);
my $rule = $tb->rule(qw/- +/);
my @rows = $tb->body();
return $rule, shift @rows, $rule, @rows, $rule
    if @rows;
}

The sql_command sub takes about ~1min to execute(on 6.5 MB file), which In my opinion is way longer that I would expect it to be. Does anyone has a more efficient solution?

Thanks!

score 6 · Accepted Answer · answered Mar 11 '13 at 11:24

Text::CSV_XS is extremely fast, using that to handle the CSV should take care of that side of the performance problem.

There should be no need for special bulk insert code to make DBD::SQLite performant. An insert statement with bind parameters is very fast. The main trick is to turn off AutoCommit in DBI and do all the inserts in a single transaction.

use v5.10;
use strict;
use warnings;
use autodie;

use Text::CSV_XS;
use DBI;

my $dbh = DBI->connect(
    "dbi:SQLite:dbname=csvtest.sqlite", "", "",
    {
        RaiseError => 1, AutoCommit => 0
    }
);

$dbh->do("DROP TABLE IF EXISTS test");

$dbh->do(<<'SQL');
CREATE TABLE test (
    name        VARCHAR,
    num1        INT,
    num2        INT,
    thing       VARCHAR,
    num3        INT,
    stuff       VARCHAR
)
SQL

# Using bind parameters avoids having to recompile the statement every time
my $sth = $dbh->prepare(<<'SQL');
INSERT INTO test
       (name, num1, num2, thing, num3, stuff)
VALUES (?,    ?,    ?,    ?,     ?,    ?    )
SQL

my $csv = Text::CSV_XS->new or die;
open my $fh, "<", "test.csv";
while(my $row = $csv->getline($fh)) {
    $sth->execute(@$row);
}
$csv->eof;
close $fh;

$sth->finish;    
$dbh->commit;

This ran through a 5.7M CSV file in 1.5 seconds on my Macbook. The file was filled with 70,000 lines of...

"foo",23,42,"waelkadjflkajdlfj aldkfjal dfjl",99,"wakljdlakfjl adfkjlakdjflakjdlfkj"

It might be possible to make it a little faster using bind columns, but in my testing it slowed things down.

I was just thinking along the same lines, only with `$conn->begin_work` and `$conn->commit`. SQLite is a slug with large inserts. — charlesbridge, Mar 11 '13 at 11:26
Thanks! What about printing queries in a nice format efficiently? — Mattan, Mar 11 '13 at 11:35
@Mattan What do you mean by "printing queries"? You mean queries on the newly created database? Sounds like another question. — Schwern, Mar 11 '13 at 11:43
I created a function named `box_format` (implementation above) which returns an align string out of a table (array of arrays) that I extracted from a query. It uses `Text::Table` to do so. This method takes about ~1Min to process (on 5MB result file). I wonder if I am able to increase performance here. — Mattan, Mar 11 '13 at 13:20
@Mattan Efficiently dumping CSV into SQLite is very different from efficiently formatting SQL queries. Rather than trying to cram them both in here, I'd suggest asking it in a new question. Post a link to it as a comment here and I'll have a look. — Schwern, Mar 11 '13 at 22:41

DBI::CSV Implementation Based on Sqlite

1 Answers1