I am populating a table in MySQL from a xml file (containing more than a billion lines) using Perl script for finding the lines of interest. The script runs very smoothly till line 15M but after that it starts increasing some what exponentially. Like for 1st 1000000 lines it took ~12s to parse and write it to database but after 15M lines the time required to parse and write the same number of lines ~43s.
I increased the innodb_buffer_pool_size from 128M to 1024M, as suggested at
Insertion speed slowdown as the table grows in mysql answered by Eric Holmberg
The time requirements came down to ~7s and ~32s respectively but it is still slow as I have a huge file to process and its time requirements keep on increasing.
Also I removed the creation of any Primary key
and Index
, thought that it might be causing some problem (not sure though)
Below is the code snippet:
$dbh = DBI->connect('dbi:mysql:dbname','user','password') or die "Connection Error: $DBI::errstr\n";
$stmt = "DROP TABLE IF EXISTS dbname";
$sth = $dbh->do($stmt);
$sql = "create table db(id INTEGER not null, type_entry VARCHAR(30) not null, entry VARCHAR(50))";
$sth = $dbh->prepare($sql);
$sth->execute or die "SQL Error: $DBI::errstr\n";
open my $fh1, '<', "file.xml" or die $!;
while (<$fh1>)
{
if ($_=~ m/some pattern/g)
{
$_=~ s/some pattern//gi;
$id = $_;
}
elsif ($_=~ m/some other pattern/)
{
$_=~ s/\s|(\some other pattern//gi;
$type = $_;
}
elsif ($_=~ m/still some other pattern/)
{
$_=~ s/still some other pattern//gi;
$entry = $_;
}
if($id ne "" && $type ne "" && $entry ne "")
{
$dbh->do('INSERT INTO dbname (id, type_entry, species) VALUES (?, ?, ?)', undef, $id, $type, $entry);
}
}
The database would contain around 1.7 million entries. What more can be done to reduce the time?
Thanks in Advance
EDIT 1:
Thank you all for help Since morning I have been trying to implement all that has been told and was checking if I get any significant reduction in time. So what I did:
- I removed matching the pattern twice as told by @ikegami, but yes I do need substitution.
- I made use of
hash
(as told by @ikegami) - I used
LOAD DATA LOCAL INFILE
(as told by @ikegami, @ysth and @ThisSuitIsBlackNot ). But I have embedded it into my code to take up the file and then process it to database. The file here is dynamically written by the script and when it reaches 1000 entries it is written to the db.
The timings of the run for consecutive 1000000 lines are
13 s
11 s
24 s
22 s
35 s
34 s
47 s
45 s
58 s
57 s .....
(Wanted to post the image but... reputation)
Edit 2:
I checked back the timings and tracked the time required by the script to write it to the database; and to my surprise it is linear. Now what I am concluding from here is that there is some issue with the while loop which I believe increases the time exponentially as it has to go to the line number for every iteration and as it reaches deep into the file it has to count more number of lines to reach the next line.
Any comments on that
EDIT 3
$start_time = time();
$line=0;
open my $fh1, '<', "file.xml" or die $!;
while (<$fh1>)
{
$line++;
%values;
if ($_=~ s/foo//gi)
{
$values{'id'} = $_;
}
elsif ($_=~ s/foo//gi)
{
$values{'type'} = $_;
}
elsif ($_=~ s/foo//gi)
{
$values{'pattern'} = $_;
}
if (keys(%values) == 3)
{
$no_lines++;
open FILE, ">>temp.txt" or die $!;
print FILE "$values{'id'}\t$values{'type'}\t$values{'pattern'}\n";
close FILE;
if ($no_lines == 1000)
{
#write it to database using `LOAD DATA LOCAL INFILE` and unlink the temp.txt file
}
undef %values;
}
if($line == ($line1+1000000))
{
$line1=$line;
$read_time = time();
$processing_time = $read_time - $start_time - $processing_time;
print "xml file parsed till line $line, time taken $processing_time sec\n";
}
}
ANSWER:
First, I would like to apologize to take so long to reply; as I started again from root to top for Perl and this time came clear with use strict
, which helped me in maintaining the linear time. And also the use of XML Parsers
is a good thing to do while handling large Xml files..
And to add more, there is nothing with the speed of MySQL inserts it is always linear
Thanks all for help and suggestions