We have setup an automated deployment pipeline on an Nginx server running Hiphop PHP using AWS Code deploy triggered by Jenkins.
Currently facing a challenge wherein during the deployment of some core files, the requests to the server get queued, it seems, and then when the system begins to recover, there are delays in processing the requests and some of them get timed out (which is fine) and we would like to ignore it. However, this registers fatal errors in the hip hop error log file, and since our post deployment sanity check logic looks for the word "Fatal", it catches the same and marks an overall failure of the deployment -
2017-03-08 11:25:17 - Error found -> [Wed Mar 8 11:24:31 2017] [hphp] [12082:7f238d3ff700:32:000001] []
\nFatal error: entire web request took longer than 10 seconds and timed out in /var/cake_1.2.0.6311-beta
app/webroot/openx/www/delivery/postGetAd.php on line 483
we can ignore this error on Hiphop Error Check Status (edited)
However, we would not like to ignore any valid error, caused by a deployment bringing in a delay in the processing of the requests. We assume that in this case, the errors will continue to appear, and will not subside, unlike what we have observed, with the false warnings.
So far, we have brought in a delay in between the install step (where the file changes are actually made on the production servers) and the afterInstall step (where the sanity checks are done), however, it seems, the delay we have selected so far, is not enough. Now, the problem is that if the delay is increased further, it will increase the overall deployment time, since we have multiple batches (each deploying to multiple production servers). We have not explored enough the trend of these errors yet.
We were thinking about the possibility of displaying the number of times the delay timeout happens, and based on that let the person doing the deployment take a call on the success. However, we need to explore further, whether the nagios logwarn utility that we are using, has provision for the same. Currently, we use it to maintain a pointer to each log file corresponding to the last time the check was done, and the next time it scans from that pointer onwards.