3

I have a large table of orders. At a set time, I need to send an SMS message (different content for each one unfortunately) to a large chunk of them (based on if they've opted for the message and if they included a phone number). It could be 200,000+ numbers in a fairly short time frame. (it's not really that high now, but theoretically could be and want to build as such).

They don't all have to send instantly, but - within 1-3 hours of the set time would be ideal.

I'm using a service with an API to send them, so that's not the problem - the problems are:

1) how to handle the large quantity

2) how to know they all got processed or re-process the ones that didn't

I don't think it would be a good idea to do a MySQL query to to get all 200,000+ phone numbers, then loop through - I'd have to assume that would suck up a lot of memory(?).

So - I thought to try a cron job, and have it run every minute (or so). In that script, I could pull maybe 5,000 records, mark as "in-process", and repeat through them processing each one then updating it's row to "sent".

But that has so many potential issues... what if the SMS service slows and I can't post them all. What if there's an unforeseen error and the script stops partway through... etc. If any number of things causes that script to not send all 5000 records, how would I know which ones to go back and re-process?

It's not only THIS process, but a few others we have - the overall question keeps coming up of how to handle a large number of rows that we need to process in some way and know that each one got done.

I hope I'm just over-complicating the crap out of this, and that there's some simpler way to handle.

If i'm not clear, please comment and I'd be glad to explain further about any aspect.

Dave
  • 28,833
  • 23
  • 113
  • 183
  • *I hope I'm just over-complicating the crap out of this*... You are. Wow didn't expect my answer to get accepted... thanks! I'm kinda new here still learning etiquette and such. – Evan de la Cruz Aug 24 '14 at 06:19

1 Answers1

3

The short version:

  1. Don't worry about memory consumption. Just don't try to fetch the entire result-set at once.

  2. Your idea of using a separate table to list every text message and then updating the row when you know if it succeeded or not is generally the correct approach (whether you do it in cron or not doesn't really matter).

  3. If you are concerned that your SMS provider might drop some of your requests, then you can implement your own queuing mechanism using a ActiveMQ or something similar. However, that kind of defeats a large part of the purpose of using a provider. They should be using their own queue so that you do not need to worry about it.

Details:

The SMS service should be notifying you of success or failure. Most high-volume SMS services queue your messages, and send them out in chunks of n messages at a time. Then, they will notify you via some sort of callback or web hook, which messages have succeeded and which have failed. Most of them also provide API's that let you check if a certain message(s) has been sent or not. You need to leverage those features.

I think you're on the right track with your cron approach.

One option is to never "pull" the records. Instead, have a column on the existing table that specifies if it is waiting for a message to send or not. This way, instead of doing a SELECT and dealing with hundreds of thousands of rows, you do a simple UPDATE and then when each callback comes from the API, you can re-update the rows with success/failure.

If you are in a situation where you might be sending multiple messages at a time for each row of data, then obviously this won't work. You will have to have a separate table with a row for every message that you want to track.

As far as your memory concerns, I don't think it is an issue. Just don't fetch the entire result set. Instead, fetch each row individually. This will prevent mysql from returning the entire data set and thus you wont need to keep it in memory.

From php.net

As mysqli_fetch_all() returns all the rows as an array in a single step, it may consume more memory than some similar functions such as mysqli_fetch_array(), which only returns one row at a time from the result set. Further, if you need to iterate over the result set, you will need a looping construct that will further impact performance. For these reasons mysqli_fetch_all() should only be used in those situations where the fetched result set will be sent to another layer for processing.

Edit / amendment

to address the comment/question:

I can't pull just one entry per chron - that would take forever... I understand I shouldn't fetch the entire result set at once too, that's what led me to ask "then how else can I do it?

In PHP (using mysqli with mysqlnd), when you do a query, it doesn't actually return the data. It prepares the data to be returned based on your query, but it doesn't return it.

When you use fetch_all you are asking for the entire result. When you use fetch_array, you are asking for the next result, and you are telling mysql to move the result cursor, so that you can get the next result after that. As long as you don't store every single result in memory (in separate variables), then there is no memory issue. Just use the row as you need to use it, and then get the next one. Whether it's a cron job or not doesn't matter.
You do not need to call the script over-and-over, once for each row. The script handles every row in a single invocation. It just does it reading one-row-at-a-time, as to save memory.

Here is a script example:

$mysqli = new mysqli("host", "user", "pass", "db");
$query = "SELECT * from TextMessages";
$result = $mysqli->query($query);
while ($row = $mysqli->fetch_array($result))
{
    //this is the only thing you store in memory, one single row at a time
    $row = $result->fetch_array(MYSQLI_ASSOC);

    //go send the text message and do whatever else you need to do
    if ($row["SomeSmsToken"] == null && $row["TextHasAlreadyBeenSentOrDateSentOrWhatever"] == false)
    {
        //$someSmsToken = $myTwilioObject->SendByRow($row);
        //$this->UpdateRowToTellItThatItHasBeenSentToProviderAndIsWaitingForResponse($row,$someSmsToken);
        //..etc...
        //then go to the next row.
    }
}
$result->free();

And then in some callback script you would do something like this.

$mysqli = new mysqli("host", "user", "pass", "db");
$query = "SELECT * from TextMessages where SomeSmsToken = '".$_POST["SomeTokenSentFromProviderInCallback"]."'";
$result = $mysqli->query($query);
while ($row = $mysqli->fetch_array($result))
{
  $someObject->UpdateRowToSayThatTheTextWasSentOrItFailed($row,$_POST["SomeStatusSentFromProviderInCallback"]);
}

$result->free();

You can also use mysqli_free_result when you're done, to free up any consumed memory by php's mysql driver.

From php.net:

You should always free your result with mysqli_free_result(), when your result object is not needed anymore.

EDIT: If you want some sort of smart way to deal with "what if the script times out", I would suggest having a cron run every minute. When it runs it should check if it is already running, and, if it isn't already running, then you run it. The script will do it's work until it times out.
Then within a minute, cron will start it again and, since it isn't running, it will run again and pick up wherever it left off.

Evan de la Cruz
  • 1,966
  • 1
  • 13
  • 17
  • 1
    There's no issue w/ the SMS provider not sending, I was just giving an example of if I can't connect to them for whatever reason. I don't understand the rest of your answer. Without pulling (ie mysql select) the rows, how would I know which row(s) to send and then update? Not sending multiple per row - just 1. I appreciate the answer, and maybe there's something to it, but I do not follow at all. I can't pull just one entry per chron - that would take forever... I understand I shouldn't fetch the entire result set at once too, that's what led me to ask "then how else can I do it?" – Dave Aug 24 '14 at 03:34
  • I've re-read this about 10 times now, and although it's well-worded, **I don't think it addresses the issues in the question**. Please clarify or let me know if I'm just not understanding what you mean. Thanks. – Dave Aug 24 '14 at 05:00
  • I think the question is very vague, and so the answer was perhaps as well. If I understand, you asked: I'm using a service with an API to send them, so that's not the problem - the problems are: 1) how to handle the large quantity 2) how to know they all got processed or re-process the ones that didn't. I will update my answer to answer the more specific question that you are now asking. – Evan de la Cruz Aug 24 '14 at 05:13
  • I have edited/amended my answer to provide details on how to get the data without using a bunch of memory. – Evan de la Cruz Aug 24 '14 at 05:24
  • Sorry for all the edits. I added some pseudo-code for saving the sms status and updating it via callback/webhook. – Evan de la Cruz Aug 24 '14 at 05:44
  • Appreciate the further clafication, but again I'm either just confused, or this doesn't really answer it. What if the script doesn't complete in one go-around? (which I assume it likely or is at least possible it won't). If I wanted a script to just run once and do something without memory issue, I could figure it out - but in this case, I assume I'm going to have to deal with this one not running all the way through, no? Or would I set the timeout for infinite and assume it will complete? I suppose that's the big question - how to deal with a script that might need more than once to finis – Dave Aug 24 '14 at 05:53
  • The answer is complete. Why wouldn't it complete in one go-round? It should. And if it doesn't you need to fix it. Because of php script timeouts? Increase them. You're basically now asking me what you should do if for an unknown reason the script doesn't complete. I can't answer that. – Evan de la Cruz Aug 24 '14 at 05:56
  • Actually, yes I can. If the script dies for some unknown reason, then you have to run it again. You know which rows have already been sent, so just don't resend them. I'll update my code to match.... Ok I updated the code with an if statement to make sure we don't send the same message twice. – Evan de la Cruz Aug 24 '14 at 05:57
  • A script should never "need more than once to finish". If it times out, then run it again. This time there will be less to do because some of it was already done last time. – Evan de la Cruz Aug 24 '14 at 06:02
  • 'A script should never "need more than once to finish". If it times out, then run it again.' - can you see how this statement is contradictory? How would I know that it timed out? How would I know that it needs to run again? And if I'm running a cron, and I'm running it every X often, and the first one takes 20 minutes to run, it may overlap - right? I'm sure it's just that I'm missing the point, but - just telling me "it's okay, just run it again" doesn't solve anything that I didn't already know. – Dave Aug 24 '14 at 06:04
  • A script should never "need more than once to finish". *If*, for some odd reason, it does happen to time out, then run it again. This time there will be less to do because some of it was already done last time. – Evan de la Cruz Aug 24 '14 at 06:10
  • How will you know that it timed out? It is irrelevant. All you need to know is that there are still some rows left that have not been sent yet. So you send them. To avoid overlap you can use any number of techniques. Check to see if the process is already running. Set a heartbeat flag in shared memory or in your database. If it gets more than X old without a heartbeat, then you know it isn't running. Please read my answer again (with the edits). If you are now asking how to prevent cron jobs from overlapping, then I think you are asking yet another question. – Evan de la Cruz Aug 24 '14 at 06:12
  • If you really want a heavy-duty solution to prevent overlap, then you can use application level locking or something like http://en.wikipedia.org/wiki/Record_locking but I don't think it is necessary if there is only one person/machine running the script. – Evan de la Cruz Aug 24 '14 at 06:18