1

I'm trying to build a service that will collect some data form web at certain intervals, then parse those data, finally upon result of parse - execute dedicated procedures. Typical schematic of service run:

  1. Request item list to be updated to
  2. Download data of listed items
  3. Check what's not updated yet
  4. Update database
  5. Filter data that contains updates (get only highest priority updates)
  6. Perform some procedures to parse updates
  7. Filter data that contains updates (get only medium priority updates)
  8. Perform some procedures to parse ... ... ...

Everything would be simple if there ware not so many data to be updated. There is so many data to be updated that at every step from 1 to 8 (maybe besides 1) scripts will fail due to restriction of 60 sec max execution time. Even if there was an option to increase it this would not be optimal as the primary goal of the project is to deliver highest priority data as first. Unlucky defining priority level of an information is based on getting majority of all data and doing lot of comparisons between already stored data and incoming (update) data.

I could resign from the service speed to get at least high priority updates in exchange and wait longer time for all the other. I thought about writing some parent script (manager) to control every step (1-8) of service, maybe by executing other scripts? Manager should be able to resume unfinished step (script) to get it completed. It is possible to write every step in that way that it will do some small portion of code and after finishing it mark this small portion of work as done in i.e. SQL DB. after manager's resuming, step (script) will continue form the point it was terminated by server due to exceeding max exec. time.

Known platform restrictions: remote server, unchangeable max execution time, usually limit to parse one script at the same time, lack of the access to many apache features, and all the other restrictions typical to remote servers

Requirements: Some kind of manager is mandatory as besides calling particular scripts this parent process must write some notes about scripts that ware activated.

Manager can be called by crul, one minute interval is enough. Unlucky, making for curl a list of calls to every step of service is not an option here.

I also considered getting new remote host for every step of service and control them by another remote host that could call them and ask for doing their job by using ie SOAP but this scenario is at the end of my list of wished solutions because it does not solve problem of max execution time and brings lot of data exchange over global net witch is the slowest way to work on data.

Any thoughts about how to implement solution?

Jimmix
  • 5,644
  • 6
  • 44
  • 71

1 Answers1

0

I don't see how steps 2 and 3 by themself can execute over 60 seconds. If you use curl_multi_exec for step 2, it will run in seconds. If you are getting your script over 60 seconds at step 3, you would get "memory limit exceeded" instead and a lot earlier.

All that leads me to a conclusion, that the script is very unoptimized. And the solution would be to:

  1. break the task into (a) what to update and save that in database (say flag 1 for what to update, 0 for what not to); (b) cycle through rows that needs update and update them, setting flag to 0. At ~50 seconds just shut down (assuming that script is run every few minutes, that will work).

  2. get a second server and set it up with a proper execution time to run your script for hours. Since it will have access to your first database (and not via http calls), it won't be a major traffic increase.

Ranty
  • 3,333
  • 3
  • 22
  • 24