2

I have a table with 700,000 entries and I need to check each entry for 1,000,000 words and then replace the word if found from hello to #~hello~#. Words can occur multiple times in an entry and need to all be replaced. I tried this in PHP and the estimated time was something like 362 days to complete the code. I just modified the code to use a LIKE in MySQL so that I didn't have check each of the 1,000,000 words against all 700,000 entries, but the estimated time for completion is still 29 days. This seems really high.

Further complicating the matter words can be multiple words. For example if the word is hello world the program should replace with #~hello world~#.

What am I missing?

The code looks something like this:

$query = "SELECT word_id, word_name, FROM words ORDER BY char_length(word_name) DESC";
$result = mysqli_query($con, $query);
while($row = mysqli_fetch_array($result)){
  $words[$i] = new wordObj($row['word_id'], $row['word_name']);
}

Foreach($words as $word){
  $query = "SELECT id, entry FROM entries WHERE entry LIKE '%".$word."%'";
  $result = mysqli_query($con, $query);
  if ($result) {
    if ($result->num_rows != 0) {
      while($row = mysqli_fetch_array($result)){
        $entry[$i] = new meatObj($row['id'], $row['entry']);
        $i++;
      }
    }else{
      $entry = '';
    }
  }else{
    $entry ='';
  }
  foreach($entryArray as $entry){
    check entry for all words and replace
  }
}
alutz
  • 192
  • 1
  • 1
  • 16

1 Answers1

2

The simplest solution would be storing all the words that needed to be replaced in the hash table. Then on each entry, we break all the word and check against the hash table.

// HOW DOES TAKE 29 DAYS TO EXECUTE?
// Create a hash table to store all the words
$hash = array();

$query = "SELECT word_id, word_name, FROM words ORDER BY char_length(word_name) DESC";
$result = mysqli_query($con, $query);
while($row = mysqli_fetch_array($result)){
    $hash[strtolower($row['word_name'])] = true;
}



// DO SOME QUERY HERE
// .....

while($row = mysqli_fetch_array($result)) {
    $delimiter = "/([ \.,\"'!\?\-_;])/";
    $tokens = preg_split($delimiter, $row['entry'], -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY));

    // replace the text
    $final = "";
    foreach($tokens as $token) {
        if (isset($hash[strtolower($token)])) {
            $final .= "#~" $token . "~#";
        } else {
            $final .= $token;
        }
    }

    // UPDATE NEW ENTRY HERE
    // .......
}
invisal
  • 11,075
  • 4
  • 33
  • 54
  • This is good and might speed up the search and replace, if it is indeed faster than the regexp. But, The bottleneck seems to be the query which takes about 2 sec for each `word` and if it takes 2,000,000 sec = 23 days just for the query... – alutz Oct 13 '15 at 04:22
  • @alutz33, isn't your solution also need to load the word? So does it take 23 days in your solution to load the word as well? – invisal Oct 13 '15 at 04:29
  • I edited the post so you can see how the `word` array is generated. I am just creating an array of objects. This process takes about 30 sec in total. – alutz Oct 13 '15 at 04:34
  • @alutz33, how does it take 29 days just to create hashing table and it takes 30 seconds for your solution while the code is almost do the same query? – invisal Oct 13 '15 at 04:41
  • Because the first query just dumps words into an array of objects and the second query is looking through 700,00 entries where each entry can be anywhere from a sentence to multiple paragraphs for a particular word. I have to do the query for each of the 1,000,000 words and each query tames about 2 sec. – alutz Oct 13 '15 at 04:51
  • @alutz33: Can we move this to chatroom? http://chat.stackoverflow.com/rooms/92097/what-is-the-fastest-way-to-do-a-regexp-find-and-replace-in-mysql-database – invisal Oct 13 '15 at 04:56
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/92098/discussion-between-invisal-and-alutz33). – invisal Oct 13 '15 at 04:59
  • @invisal's solution runs the query just once, not once per word! – Strawberry Oct 13 '15 at 07:19
  • Yes! I realized this as I went to sleep. Late night make the brain go foggy. There still remains 1 problem with the solution and this is my fault for forgetting to mention it in the problem statement. The words being searched for can be more than 1 word. For example, `hello world` might be a term to turn into `#~hello world~#`. how would you include this in your solution? – alutz Oct 13 '15 at 11:30