0

We have setup an automated deployment pipeline on an Nginx server running Hiphop PHP using AWS Code deploy triggered by Jenkins.

Currently facing a challenge wherein during the deployment of some core files, the requests to the server get queued, it seems, and then when the system begins to recover, there are delays in processing the requests and some of them get timed out (which is fine) and we would like to ignore it. However, this registers fatal errors in the hip hop error log file, and since our post deployment sanity check logic looks for the word "Fatal", it catches the same and marks an overall failure of the deployment -

2017-03-08 11:25:17 - Error found -> [Wed Mar  8 11:24:31 2017] [hphp] [12082:7f238d3ff700:32:000001] []
\nFatal error: entire web request took longer than 10 seconds and timed out in /var/cake_1.2.0.6311-beta
app/webroot/openx/www/delivery/postGetAd.php on line 483
we can ignore this error on Hiphop Error Check Status (edited)

However, we would not like to ignore any valid error, caused by a deployment bringing in a delay in the processing of the requests. We assume that in this case, the errors will continue to appear, and will not subside, unlike what we have observed, with the false warnings.

So far, we have brought in a delay in between the install step (where the file changes are actually made on the production servers) and the afterInstall step (where the sanity checks are done), however, it seems, the delay we have selected so far, is not enough. Now, the problem is that if the delay is increased further, it will increase the overall deployment time, since we have multiple batches (each deploying to multiple production servers). We have not explored enough the trend of these errors yet.

We were thinking about the possibility of displaying the number of times the delay timeout happens, and based on that let the person doing the deployment take a call on the success. However, we need to explore further, whether the nagios logwarn utility that we are using, has provision for the same. Currently, we use it to maintain a pointer to each log file corresponding to the last time the check was done, and the next time it scans from that pointer onwards.

Sandeepan Nath
  • 9,966
  • 17
  • 86
  • 144
  • I integrated our Change Management system into our Nagios system, so that when a change record (deployment) is going to start, it automatically schedules down-time for the host and all its services via the Nagios external command file. This down-time lasting for the length the change is actually scheduled. – Jim Black Mar 16 '17 at 16:08
  • I am not sure what criteria should make us think of having a downtime. What I realise is that team is worried that taking out around 50 servers out of the total 200 from the load balancer at a time is going to put a lot of additional load on the other servers. We did not do any POC yet I guess. – Sandeepan Nath Mar 16 '17 at 18:15

0 Answers0