Word matching in corpus of text is very slow

Question

I have two tables. In Table 1, I have around 400K rows where each row includes a paragraph of text that can be up to 50 sentences. In Table 2 I have a lexicon of 80k words with a score that I need for the coding of each word of each paragraph.

The whole point of my php script is to explode each paragraph of text into as many words as needed, then lookup into the lexicon for each word what the score is, and end up calculating the total score for all of the words for each row.

My strategy so far was to have a script that did the following:

Connect to the database, Table 1
While Loop, one row after another
For the current row, explode the paragraph.
For each word, look up into Table 2 if the word exists and return the score.
Ending up with a total score for the current row.
Updating Table 1 with the total score for the current paragraph.
Going back to point number 2.

My code works but is not efficient. The problem is that the script is so slow that letting it run for an hour just calculates the first 500 rows. That's a problem because I have 400K rows. I will need this script for other projects.

What would you advise me to do to make this process less intense?

<?php

//Include functions
        include "functions.php";
        ini_set('max_execution_time', 9000000);
        echo 'Time Limit = ' . ini_get('max_execution_time');
        $db='senate';   
//Function to search into the array lexicon     
        function searchForId($id, $array) {
        foreach ($array as $key2 => $val) {
        if ($val['word'] === $id) {
        return $key2;
       } 
   }
   return null;
}       

// tags to remove
        $remove   = array('{J}','{/J}','{N}','{/N}','{V}','{/V}','{RB}','{/RB}');       
        $x=1;
//Conecting the database
        if (!$conn) {
        die('Not connected : ' . mysql_error());}


// Choose the current db
        mysql_select_db($db);

//Slurps the lexicon into an array
$sql = "SELECT word, score FROM concreteness";
$resultconcreteness = mysql_query($sql) or die(mysql_error());
$array = array();
while($row = mysql_fetch_assoc($resultconcreteness)) {
$array[] = $row;
}

//loop      
        while($x<=500000) {
        $data = mysql_query("SELECT `key`, `tagged` FROM speechesLCMcoded WHERE `key`='$x'") or die(mysql_error());

// puts the "data" info into the $info array 
        $info = mysql_fetch_array( $data);
        $tagged=$info['tagged'];
        unset($weight);
        unset($count);
        $weight=0;
        $count=0;

// Print out the contents of the entry 
        Print "<b>Key:</b> ".$info['key'] .  " <br>";

// Explodes the sentence
        $speech = explode(" ", $tagged);

// Loop every word  
        foreach($speech as $word) {

//Check if string contains our tag

if(!preg_match('/({V}|{J}|{N}|{RB})/', $word, $matches)) {} else{

//Removes our tags
        $word = str_replace($remove, "", $word);

        $id = searchForId($word, $array);
//      print "ID: " . $id . "<br>";
//      print "Word: " . $array[$id]['word'] . "<br>";
//      print "Score: " . $array[$id]['score'] . "<br>";
        $weight=$weight+$array[$id]['score'];
        $count=$count +1;
//      print "Weight: " . $weight . "<br>";
 //     print "Count: " . $count . "<br>";
        }
}
        $sql = "UPDATE speechesLCMcoded SET weight='$weight', count='$count' WHERE `key`='$x';" ;
        $retval = mysql_query( $sql, $conn );
        if(! $retval )
        {die('Could not update data: ' . mysql_error());}
        echo "Updated data successfully\n";
        ob_flush();
        flush();   

        //Increase the loop by one
        $x=$x+1;

}?>

Here is the index:

CREATE TABLE `speechesLCMcoded` (
 `key` int(11) NOT NULL AUTO_INCREMENT,
 `speaker_state` varchar(100) NOT NULL,
 `speaker_first` varchar(100) NOT NULL,
 `congress` varchar(100) NOT NULL,
 `title` varchar(100) NOT NULL,
 `origin_url` varchar(100) NOT NULL,
 `number` varchar(100) NOT NULL,
 `order` varchar(100) NOT NULL,
 `volume` varchar(100) NOT NULL,
 `chamber` varchar(100) NOT NULL,
 `session` varchar(100) NOT NULL,
 `id` varchar(100) NOT NULL,
 `raw` mediumtext NOT NULL,
 `capitolwords_url` varchar(100) NOT NULL,
 `speaker_party` varchar(100) NOT NULL,
 `date` varchar(100) NOT NULL,
 `bills` varchar(100) NOT NULL,
 `bioguide_id` varchar(100) NOT NULL,
 `pages` varchar(100) NOT NULL,
 `speaker_last` varchar(100) NOT NULL,
 `speaker_raw` varchar(100) NOT NULL,
 `tagged` mediumtext NOT NULL,
 `adjectives` varchar(10) NOT NULL,
 `verbs` varchar(10) NOT NULL,
 `nouns` varchar(10) NOT NULL,
 `weight` varchar(50) NOT NULL,
 `count` varchar(50) NOT NULL,
 PRIMARY KEY (`key`)
) ENGINE=InnoDB AUTO_INCREMENT=408344 DEFAULT CHARSET=latin1

To be honest, I think MySQL is the wrong technology for doing this. Maybe take a look at something like Lucene or Elasticsearch if you want to do heavy scoring during data analysis. — t j, May 09 '14 at 02:32
Thanks for the suggestion, but I cannot afford changing of technology now. I am really looking for a way to optimize the current system. — user1029296, May 09 '14 at 02:33
Slow is one thing, but an hour to process 500 rows seems really, really slow. That is about 12 seconds per paragraph. I would suggest moving the php code into a stored procedure. Doing the work for a single paragraph all inside the database should show a considerable improvement in performance. — Gordon Linoff, May 09 '14 at 02:35
Maybe try logging your queries and then analysing them using the MySQL EXPLAIN command. That will tell you about any problems with your indexes and how to optimise them. https://dev.mysql.com/doc/refman/5.0/en/explain.html — t j, May 09 '14 at 02:40
I added my code if it is of any help to understand what is so slow. — user1029296, May 09 '14 at 05:40
Please show the structure of `speechesLCMcoded` including indexes. — O. Jones, May 09 '14 at 12:37
speechesLCMcoded has a "key" column, with an autoincrement, several fields related to who wrote the paragraph in that row, a column named "raw" that includes the text without tags, a column named "tagged" that includes tagged text, and two columns "weight" and "count" that include the cumulated score and number of scored words by the script. I do not have an index as I do not know how/why to have them. — user1029296, May 09 '14 at 13:27
Please edit your question to show the output of `SHOW CREATE TABLE speechesLCMcoded` . — O. Jones, May 09 '14 at 13:45
Just to get it straight, what you're doing is extract words from paragraphs, compare them to lexicon for score and then update the speechesLCMcoded with weight/count? Would you be willing to completely alter your structure for a really quick algorithm? — N.B., May 09 '14 at 14:26
That is exactly what I am doing. I am not really looking to alter the structure if it takes a lot of effort because I just want the script to run once. If it takes 5 hours I am fine. I just can't have it run 45 days... — user1029296, May 09 '14 at 14:31
I'd add another table. It'd contain words from the record that contains your paragraph. Let's call it paragraph2words. I'd let mysql compare each word from that table to lexicon. That'd be really quick and I'd move that to procedure. Procedure would update your initial table with two numbers in question. PHP would have no job here whatsoever. 400k records is nothing, comparing it against 80k should be also quick. This is just a skeleton of what'd happen, but I think you might get the general idea. — N.B., May 09 '14 at 14:43

O. Jones · Accepted Answer · 2014-05-09T15:34:57.520

You have a fairly small reference table (your lexicon) and an enormous corpus of text (table 1).

If I were you I would start your program by slurping the entire lexicon from the table into a php array in memory. Even if all your words are 20 characters in length this will only take a dozen or so megabytes of RAM.

Then do your step 4 by looking up the words in memory rather than by using a SQL query. Your inner loop (for each word) will be much faster, and equally as accurate.

Be careful about one thing, though. You'll need to normalize the words in your lexicon by converting them to lower case if you are to replicate the case-insensitive lookup behavior of MySQL.

Edit after seeing your code

Some pro tips:

Indent your code properly so you can see the structure of your loops at a glance.
Remember that passing data to functions takes time.
PHP arrays are associative. You can do $value = $array[$key]. This is fast. You don't have to search an array linearly. You're doing that once per word !!
Prepared statements are good.
Repeating an SQL statement when you could read the next row from its result set is bad.
Streaming result sets is good.
The mysql_ set of function calls are deprecated and despised by their developers, and everybody else, for good reasons.

There's way too much going on in your loops.

What you need is this:

First of all, switch to using mysqli_ from using mysql_ interfaces. Just do it. mysql_ is too slow, old, and crufty.

$db = new mysqli("host", "user", "password", "database");

Second, change the way you are loading your lexicon, to optimize the whole associative-array dealio.

$lookup = array();
//Slurps the lexicon into an array, streaming it row by row
$sql = "SELECT word, score FROM concreteness";
$db->real_query($sql) || die $db->error;
$lkup = $db->use_result();
while ($row = $lkup->fetch_row()) {
      $lookup[strtolower($row[0])] = $row[1];
}
$lkup->close();

This gives you an associative array called $lookup. If you have a $word, you can find its weight value this way. This is fast. What you have in your example code is very slow. Notice that the keys are all converted to lower case both when they are created and when words are looked up. Don't put this in a function if you can avoid it, for performance reasons.

if (array_key_exists( strtolower($word), $lookup )) {
    $weight += $lookup[strtolower($word)]; /* accumulate weight */
    $count ++;                             /* increment count   */
}
else {
  /* the word was not found in your lexicon. handle as needed */
}

Finally, you need to optimize your querying of the rows of your text corpus, and its updating. I believe you should do that using prepared statements.

Here's how that will go.

Near the beginning of your program, place this code.

$previouskey = -1;
if (/* you aren't starting at the beginning */) {
   $previouskey = /* the last successfully processed row */
}

$get_stmt = $db->prepare('SELECT `key`, `tagged` 
                           FROM speechesLCMcoded 
                          WHERE `key` > ?
                          ORDER BY `key` LIMIT 1' );

$post_stmt = $db->prepare ('UPDATE speechesLCMcoded 
                               SET weight=?, 
                                   count=? 
                             WHERE `key`=?' );

These give you two ready-to-use statements for your processing.

Notice that the $get_stmt retrieves the first key you haven't yet processed. This will work even if you have some missing keys. Always good. This will be decently efficient because you have an index on your key column.

So here's what your loop ends up looking like:

 $weight = 0;
 $count = 0;
 $key = 0;
 $tagged = '';

 /* bind parameters and results to the get statement */
 $get_stmt->bind_result($key, $tagged);
 $get_stmt->bind_param('i', $previouskey);

 /* bind parameters to the post statement */
 $post_stmt->bind_param('iii',$weight, $count, $key);

 $done = false;
 while ( !$done ) {
    $get_stmt->execute();
    if ($get_stmt->fetch()) {

        /* do everything word - by - word  here on the $tagged string */

        /* do the post statement to store the results */
        $post_stmt->execute();

        /* update the previous key prior to next iteration */
        $previouskey = $key; 
        $get_stmt->reset();
        $post_stmt->reset();
    } /* end if fetch */
    else {
       /* no result returned! we are done! */
       $done = true;
    }
 } /* end while not done */

This should get you down to subsecond processing per row.

Thanks for your answer. I tried that and it basically takes 10 seconds to go from one row to another, which means around 45 days of processing... — user1029296, May 09 '14 at 05:28
I added my code, if it is of any help. I know it's messy... I am new to PHP. — user1029296, May 09 '14 at 05:40
The statement about the copy isn't quite right, PHP passes the pointer and only creates a copy if you alter the variable within the scope of your function. This isn't true if you pass an object and of course not if the function gets a reference. If you aren't changing the variable in your function, no copy is created. Passing by reference is even slower than not passing by reference, even if PHP has to create a copy at some point. — Fleshgrinder, May 09 '14 at 14:48
Entire algorithm in question can be done solely with MySQL and PHP wouldn't have to do anything except enter new paragraphs. Not downvoting it since it does answer the question to some point, but there are far better alternatives. — N.B., May 09 '14 at 14:51
This is very helpful. I will need time to understand exactly how it works and to adapt it, but it looks like it's what I need. Thanks a lot. — user1029296, May 09 '14 at 15:19
@Fleshgrinder that's a good point. But the linear search of the lookup table is still suboptimal. The original questioner will probably get most of her performance improvement from using an associative array. — O. Jones, May 09 '14 at 15:28
@N.B., why not write an answer explaining how to do that? I suppose you're thinking of using some stored procedures, but it's hard to tell from your comment. — O. Jones, May 09 '14 at 15:30
My comment was only regarding the fact of copy/reference/etc. I wasn't saying that your answer is wrong in any way. — Fleshgrinder, May 09 '14 at 16:15

score 1 · Answer 2 · answered May 09 '14 at 06:43

First and obvious optimization is like this:

include "functions.php";
set_time_limit(0); // NOTE: no time limit
if (!$conn)
    die('Not connected : ' . mysql_error());
$remove = array('{J}','{/J}','{N}','{/N}','{V}','{/V}','{RB}','{/RB}'); // tags to remove       
$db = 'senate';
mysql_select_db($db);

$resultconcreteness = mysql_query('SELECT `word`, `score` FROM `concreteness`') or die(mysql_error());
$array = array(); // NOTE: init score cache
while($row = mysql_fetch_assoc($resultconcreteness))
    $array[strtolower($row['word'])] = $row['score']; // NOTE: php array as hashmap
mysql_free_result($resultconcreteness);

$data = mysql_query('SELECT `key`, `tagged` FROM `speechesLCMcoded`') or die(mysql_error()); // NOTE: single query instead of multiple
while ($row = mysql_fetch_assoc($data)) {
    $key = $row['key'];
    $tagged = $row['tagged'];
    $weight = $count = 0;
    $speech = explode(' ', $tagged);
    foreach ($speech as $word) {
        if (preg_match('/({V}|{J}|{N}|{RB})/', $word, $matches)) {
            $weight += $array[strtolower(str_replace($remove, '', $word))]; // NOTE: quick access to word's score
            $count++;
        }
    }
    mysql_query('UPDATE `speechesLCMcoded` SET `weight`='.$weight.', `count`='.$count.' WHERE `key`='.$key, $conn) or die(mysql_error());
}
mysql_free_result($data);

Check the comments with NOTE:

But for 400K rows it sill will take some time, at least because you have to update each row, this means 400K updates.

Possible future optimizations:

Make this script get arguments like start offset and length (pass them to mysql LIMIT), so you will be able to run several scripts to process different blocks of table at the same time
Instead of updates - generate file with data, then use LOAD DATA INFILE to replace your table, it could be faster neither 400K updates

Word matching in corpus of text is very slow

2 Answers2