1

We have the following blue=green deployment design. Idea is for us to

  • deploy the latest code into inactive cluster
  • smoke test
  • switch VIP to make the current one inactive

and we created the pipelines accordingly in go.cd. However, the issue we have is that we want to deploy the latest code to the cluster that has newly transitioned to inactive state. How do we make sure that this one doesn't again become active? OR How are others doing blue-green deployments? Google search results in solutions geared towards AWS. We don't use AWS or public cloud.

EDIT 1

Infrastructural constraints: We have hardware available only for two clusters

What stops you from running the batch jobs in the live cluster?: Live cluster is serving production queries and batch load will take up the machine resources, and might make the online system non-responsive

enter image description here

Aravind Yarram
  • 78,777
  • 46
  • 231
  • 327

1 Answers1

1

I'm not sure if this will help you, but in our setup we have a load balancer to which the clients talk. This LB know which instances are live, and which are dark and forwards the traffic accordingly. If the request has a 'special' header, the LB sends the traffic to the dark pool. We have this setup per application (just making this clear as in the diagram you have posted, some people might think that the whole platform is blue-green)

So a diagram of it would be, where the green cluster is live and the blue is dark (<3 ascii art)

           [Client]      <- I assume this is internal, otherwise add a FW :).
               |
              \|/
   [Application Load Balancer]  <- internal, per app
               |
               |\--------------\--------------\--------------\
              \|/             \|/            \|/            \|/
         [Node 1 G/L]    [Node 2 G/L]    [Node 3 B/D]  [Node 4 B/D]


G = Green  B = Blue
L = Live   D = Dark 

The Application Load Balancer can be a number of technologies. It could be a Gateway app (like Netflix Zuul) or a load balancing webserver (like AirBnB Smartstack which uses HAProxy).

It's worth mentioning that if the live cluster goes up in flames, we don't automatically promote the dark cluster to live... What I'm trying to say is that we don't use blue/green as an alternative for High Availability. Is this your concern? (as you're using VIPs here and keepalived)

Edit

Thanks for the answers to the questions.Unfortunately, I don't think you'll be able to blue-green successfully with your constraint.

Have you considered have just one big environment and then doing some sort of hybrid between Canary Release and blue-green? With this approach, initially you have 5 servers serving live traffic and 1 serving dark traffic (I assume you have 6 boxes in total). The live nodes could be configured so 3 nodes take live traffic and 2 do the batch processing.

When you're happy with the code in the dark pool, you start upgrading the servers one by one until you have all the servers serving live traffic in the live pool. At that point, you might need to move the 2 batch processing servers to the light pool, unless you have a way to moving them more slowly (probably one job at a time?).

Just in case, I want to make something really clear as this might come to bite you (and I don't like fellow developers to be in pain). If your batch processing is a fundamental part of your platform, you don't have a true HA environment, for the reason I outlined in my original answer, if your live cluster fails for any reason (DB corruption?) you won't be able to run in the remaining hardware.

Augusto
  • 28,839
  • 5
  • 58
  • 88
  • I think the problem with us (due to infrastructural constraints) is that the Blue/Green is trying to solve 2 problems. 1. High availability 2. to run batch jobs against the dark cluster. since we are using it for multiple purposes, we need to keep both the clusters up to date. – Aravind Yarram Sep 27 '15 at 02:02
  • My intuition was correct :S... And that's 3 concerns (CD, HA and off loading work). There's some risk in mixing HA with offloading work, as if the live cluster goes down, are you going to process the live traffic + batch processing on one of the clusters? A few questions (feel free to update your question). **1)** Can you enumerate the infrastructure constrains? **2)** What stops you from running the batch jobs in the live cluster? – Augusto Sep 27 '15 at 16:38
  • Augusto - thx for taking time to respond. Seems like any approach we take right now doesn't solve all the issues. – Aravind Yarram Sep 29 '15 at 00:04