1

I am populating a table in MySQL from a xml file (containing more than a billion lines) using Perl script for finding the lines of interest. The script runs very smoothly till line 15M but after that it starts increasing some what exponentially. Like for 1st 1000000 lines it took ~12s to parse and write it to database but after 15M lines the time required to parse and write the same number of lines ~43s.

I increased the innodb_buffer_pool_size from 128M to 1024M, as suggested at

Insertion speed slowdown as the table grows in mysql answered by Eric Holmberg

The time requirements came down to ~7s and ~32s respectively but it is still slow as I have a huge file to process and its time requirements keep on increasing.

Also I removed the creation of any Primary key and Index, thought that it might be causing some problem (not sure though)

Below is the code snippet:

$dbh = DBI->connect('dbi:mysql:dbname','user','password') or die "Connection Error: $DBI::errstr\n";

$stmt = "DROP TABLE IF EXISTS dbname";
$sth = $dbh->do($stmt);

$sql = "create table db(id INTEGER not null, type_entry VARCHAR(30) not null, entry VARCHAR(50))";
$sth = $dbh->prepare($sql);
$sth->execute or die "SQL Error: $DBI::errstr\n";
open my $fh1, '<', "file.xml" or die $!;

while (<$fh1>)
{
    if ($_=~ m/some pattern/g)
    {
        $_=~ s/some pattern//gi;
        $id = $_;
    }
    elsif ($_=~ m/some other pattern/)
    {
        $_=~ s/\s|(\some other pattern//gi;
        $type = $_;
    }
    elsif ($_=~ m/still some other pattern/)
    {
        $_=~ s/still some other pattern//gi;
        $entry = $_;
    }
    if($id ne "" && $type ne "" && $entry ne "")
    {
        $dbh->do('INSERT INTO dbname (id, type_entry, species) VALUES (?, ?, ?)', undef, $id, $type, $entry);
    }
}

The database would contain around 1.7 million entries. What more can be done to reduce the time?

Thanks in Advance

EDIT 1:

Thank you all for help Since morning I have been trying to implement all that has been told and was checking if I get any significant reduction in time. So what I did:

  1. I removed matching the pattern twice as told by @ikegami, but yes I do need substitution.
  2. I made use of hash (as told by @ikegami)
  3. I used LOAD DATA LOCAL INFILE (as told by @ikegami, @ysth and @ThisSuitIsBlackNot ). But I have embedded it into my code to take up the file and then process it to database. The file here is dynamically written by the script and when it reaches 1000 entries it is written to the db.

The timings of the run for consecutive 1000000 lines are

13 s

11 s

24 s

22 s

35 s

34 s

47 s

45 s

58 s

57 s .....

(Wanted to post the image but... reputation)

Edit 2:

I checked back the timings and tracked the time required by the script to write it to the database; and to my surprise it is linear. Now what I am concluding from here is that there is some issue with the while loop which I believe increases the time exponentially as it has to go to the line number for every iteration and as it reaches deep into the file it has to count more number of lines to reach the next line.

Any comments on that

EDIT 3

$start_time = time();
$line=0;
open my $fh1, '<', "file.xml" or die $!;
while (<$fh1>)
{
$line++;
%values;
if ($_=~ s/foo//gi)
{
$values{'id'} = $_;
}
elsif ($_=~ s/foo//gi)
{
$values{'type'} = $_;
}
elsif ($_=~ s/foo//gi)
{
$values{'pattern'} = $_;
}
if (keys(%values) == 3)
{
$no_lines++;
open FILE, ">>temp.txt" or die $!;
print FILE "$values{'id'}\t$values{'type'}\t$values{'pattern'}\n";
close FILE;
if ($no_lines == 1000)
{
#write it to database using `LOAD DATA LOCAL INFILE` and unlink the temp.txt file
}
undef %values;
}
if($line == ($line1+1000000))
{
$line1=$line;
$read_time = time();
$processing_time = $read_time - $start_time - $processing_time;
print "xml file parsed till line $line, time taken $processing_time sec\n";
}
}

ANSWER:

First, I would like to apologize to take so long to reply; as I started again from root to top for Perl and this time came clear with use strict, which helped me in maintaining the linear time. And also the use of XML Parsers is a good thing to do while handling large Xml files..

And to add more, there is nothing with the speed of MySQL inserts it is always linear

Thanks all for help and suggestions

Community
  • 1
  • 1
Rajinder
  • 41
  • 1
  • 9
  • 2
    all mysql tables have a primary key; if you don't have one, mysql adds a hidden one for you. so if you have a field that would work as primary key, it's better to explicitly make it so. – ysth Jan 29 '15 at 14:17
  • 1
    show your real code for the while loop? what you have here doesn't really make sense; it looks like you will skip some initial lines (where one of the three variables is not set) but thereafter write for every single line...and if id is your primary key, the lines that change entry or type will have the insert fail because of a duplicate primary key – ysth Jan 30 '15 at 15:42

1 Answers1

1

I'm guessing the bottleneck is the actual insertion. It will surely be a bit faster to generate the INSERT statements, place them in a file, then execute the file using the mysql command line tool.

You can experiment with creating INSERT statements that insert a large number of rows vs individual statements.

Or may it's best to avoid INSERT statements entirely. I think the mysql command line tool has a facility to populate a database from a CSV file. That might possibly yield a little bit more speed.

Better yet, you can use LOAD DATA INFILE if you have access to the file system of the machine hosting the database.


Your Perl code could also use some cleaning up.

  • You search for every pattern twice? Change

    if (/foo/) { s/foo//gi; $id = $_ }
    

    to

    if (s/foo//gi) { $id = $_ }
    
  • Actually, do you need a substitution at all? This might be faster

    if (/foo (.*)/) { $id = $1 }
    
  • Looks like you might be able to do something more along the lines of

    my ($k, $v) = split(/:\s*/);
    $row{$k} = $v;
    

    instead of that giant if.

  • Also, if you use a hash, then you can use the following for the last check:

    if (keys(%row) == 3)
    
ikegami
  • 367,544
  • 15
  • 269
  • 518
  • the command line tool is just a client like any other, so its inserts aren't any faster. you can use the csv import command `LOAD DATA` either through it or just in your script. – ysth Jan 29 '15 at 14:21
  • @ysth, Sure it will be. Less overhead. No scalar creation, etc – ikegami Jan 29 '15 at 14:21
  • 1
    LOAD DATA has a LOCAL option to make the client read the file. it can be a little difficult to get either flavor working (correctly setting mysql permissions/settings or security hardening on the server). Batching inserts (having each insert statement insert thousands of rows) gets you almost as much benefit. – ysth Jan 29 '15 at 14:29
  • @ysth In my experience, `LOAD DATA INFILE` is significantly faster than batched inserts after a certain threshold (in the case of the application I benchmarked for, roughly 100 rows, but this will vary for each application). I haven't used it with `LOCAL`, which would be slower, but a benchmark is really necessary. Don't forget that the maximum size of batched inserts is limited by `max_allowed_packet`, which only goes up to 1 GB. – ThisSuitIsBlackNot Jan 29 '15 at 14:51
  • 1
    @ThisSuitIsBlackNot: yes. but either will be much much faster than individual inserts; I suspect that just batching the inserts will be fast enough in this case. – ysth Jan 29 '15 at 16:00
  • Thanks @ysth I added back the primary key – Rajinder Jan 30 '15 at 08:45