-1

I developed a site using Zend Framework 2. It is basically a price comparison site that integrates with many of the top affiliate networks out there. I wrote a script that checks prices from each affiliate network, and then updates my local DB with that price. Depending on which affiliate network I am contacting, I may be making an API call (Amazon or CJ.com), or I may be looking at an XML product feed (Pepperjam or LinkShare). The XML product feed would be hosted locally.

At present, there are around 3,500 sku's that I am checking with this script. The vast majority of them (95%+) are targeting an XML product feed. I would estimate that this script should probably take in the neighborhood of 10 minutes to complete. Some of the XML files I am looking at are around 8 MB in size.

I have tested this script thoroughly in my local environment and taken great lengths to make sure that there is no memory leak or something of that nature which would cause performance issues. As an example, I made sure to use data streams where possible to avoid putting the XML file in memory over and over, etc. Suffice to say, the script runs locally without issue.

This script is intended to be run as a cron job, however I do have a way to trigger it via the secure admin interface ad-hoc. Locally, this is how I initiate the script to run, and everything goes rather smoothly.

When I deploy my code to the shared hosting account, I am having all sorts of problems. In order to troubleshoot, I attached logging to various stages of this script to track when it starts, how it progresses, and when each step completes, etc. All of this is being logged to a MySQL database.

Problem #1: If I run the script ad-hoc via an HTTP request, I find that it will run for a couple minutes, and then the script starts again (so there are now two instance apparently running). Wait another couple minutes, and a third one will start, etc..... Here is an example when I triggered the script to run at 10:09pm via an HTTP request.

Screenshot of process manager

Needless to say, I DO NOT run it via an HTTP request because it only serves to get me in trouble with my web hosting provider :)

Problem #2: When the script runs on the server, triggered via a cron job, it is failing to complete. I have taken the production copy of the database and taken it locally along with the XML files, it runs fine. So it should not be a problem with bad data exposing bad code. My observation is - the script nearly runs for the exact same amount of time - before aborts, or is terminated, or whatever. The last record updated is generally timestamped around 4 minutes and 30 seconds or so (if memory serves) after the script is triggered. The SKU list is constantly changing so the record that it ends on differs, but the the time of the last update is nearly the same each time. Nothing is being logged in the error logs. I monitored server resources via SSH top command and there is nothing out of the ordinary. CPU usage is in check and memory used does not go up.

I have a shared hosting account through Bluehost. My thoughts were that perhaps it was a script max execution time issue. I extended the max execution time in the script itself and via php.ini. Made no difference.

So I guess what I am looking for is some fresh ideas of where to go next. What questions should I be asking my hosting company so they can help me get to the bottom of this. They are only somewhat helpful to say the least. Could it be some limitation on my hosting account? Triggering some sort of automatic monitor that is killing the script? What types of Apache settings could be problematic for a script of this nature? PHP.ini settings? Absolutely any input you can provide would be helpful.

And why, when triggered via HTTP, would it keep spinning up new instances? I guess I could live w/o running it manually, and only run it via a cron job, but that isn't working either. So .... interested in hearing the communities thoughts on this. Thanks!

Cody
  • 1
  • 1

1 Answers1

1

I haven't seen your script, neither did I work with your hoster, so everything below is just a guess - and a suggestion.

Given your description, I would say you're right that your script might have been killed by timeout when run from cron. I'm not sure why it keeps spawning new instances of your script when you execute it manually via an HTTP request, but it may also be related to a timeout (e.g. if they have a logic that restarts a script if it has not produced an output within a certain time, or something like that).

You can follow up with your hosting provider about running long-running (or memory-consuming) script in their environment, and they might have some FAQ or document already written that covers this topic.

Let me suggest an option for you in case if your provider is unable to help.

From what you said, I expect your script runs an SQL query to get a list of SKUs, and then slowly iterates over this list, performing some job on every item (and eventually dies for whatever reason, as we learned).

How about if you create a temporary table (or file - just any kind of persistent storage on the server) that would save the last processed record ID of the script, or NULL if the script successfully completed. That way you'll be able to make your script start with the last processed record (if the last processed record had id = 1000, add ... WHERE id > 1000 to the main query that fetches SKUs), and you won't really care if the script completed its first attempt or not (if not, it will keep processing from that very point when it was killed, on its second try).

Alternatively, to extend this approach, you can limit one invocation to the certain amount of records to process (e.g. 100 or 1000), again, saving the last processed record ID in the database or somewhere else.

The main idea is: if the script fails to process all SKUs at once, just make it restartable so that it does not lose its progress.

afenster
  • 3,468
  • 19
  • 26
  • Thank you for the reply. When it is run via a cron job, after termination, it does not restart on it's own. I did consider scheduling the job to run every 30 minutes or something like that, then ordering the SKU query as you suggested. Each SKU record has a 'LastUpdated' field that holds a time stamp. I was kind of thinking this was the last option, but the perfectionist in me wants to figure out the source issue. I will reach out to my hosting provider and see if they are at all helpful in this regard. – Cody Dec 16 '18 at 14:20