0

I am migrating a Grails (2.4.4) web application (Java based) on a Docker Swarm environment (3 nodes) with a MariaDB database outside the Swarm.

On its current version, the application is hosted on a single server and instanciates a few threads (~ 10) which all got a specific task to do, but are all manipulating data from the Database.

Now that the application will be replicated on the 3 nodes of the Swarm, I don't think there is a point in instanciating all the threads on each node, because they would do the same thing on the same data (located on the database machine outside the Swarm), and it probably won't work because of concurrent access and MySql transactions.

So, considering the fact that the threads cannot be redevelopped outside the application's source code as they rely on its model, my question is: what do you think would be the best solution for this use case ? I personnaly thought about two options, but I don't feel like I'm going in the right direction:

  1. Synchronize the threads at the begining of their process: it would only be the first thread of its kind to actually do the job, the 2 others would put themself back to sleep. I would do that with something like a lock mechanism in the database.
  2. Only instanciate the threads on one node: I think it should be possible, but I am really unsure about this one, as it would be extremely contradictory with the core principle and advantages of having a replicated & scalable application

So, happy to hear any advice on the matter ! Thanks

Alarid
  • 770
  • 1
  • 6
  • 19
  • 1
    Deploy each task as it's own and let swarm do the scaling or use a distributed cron (like e.g. airflow) – cfrick Jan 30 '20 at 11:41

3 Answers3

2

There are many possible designs here, they all pretty much depend on what you're actually trying to achieve:

You say:

On its current version, the application is hosted on a single server and instantiates a few threads (~ 10) which all got a specific task to do

Lets suppose you go with option 2 and all the threads are running on one node out of three.

In this case this is a kind of active-passive architecture, one node works, other nodes do basically nothing (well, I don't know maybe they do other things, its out of scope based on the information you've given). So these nodes are maintained for redundancy?

But if so, the working node will "absorb" all the load till it fails, then all the load gets to the node that becomes active, but maybe its too much load and it will fail, then the third node will fail like a domino :)

One additional issue with this approach is how to actually make the node2 active when node1 has failed, Who will decide that node2 (as opposed to node3) is active now? How do you spawn the threads if the configuration "run-without-threads" is already specified?

If you can answer these questions, do not expect high pressure on a single node, and agree to maintain nodes for redundancy - will it can be a way to go, many systems are built this way, so its a viable solution.

Another solution is to fully scale-out the solution with active-active architecture, so that a part of tasks will be taken by node1, another part will be handled by node2 and node3.

Here there are many possible options. You say

which all got a specific task to do

Who actually triggers this task for execution? Is it some scheduled job that runs once in a while an submits a task for execution? Or maybe the task is spawned as a result of some request that comes from client (like over http call)?

Another question is whether there are tasks that basically should not overlap or potentially every task can corrupt the execution of an another task?

If there is a chance to separate tasks then you could send a message to some queue when the new task arrives and based on some partitionId (if you use something like kafka) or routing key in rabbit mq or any other way of clustering you could build an architecture where the tasks could be grouped together by type and one specific server would take care of the execution of the whole group of tasks, another group of tasks could be executed by another server.

If the server will go down, then the group of tasks previously handled by the failed server will be re-assigned to another server (technical details vary depending on solution).

Mark Bramnik
  • 39,963
  • 4
  • 57
  • 97
  • Thanks for your detailed answer. I cannot answer every questions without having to go into the detail of the application's context and it would be too long, but we basically chose to externalize the triggering of the tasks outside the app on another service in the swarm, pretty much like a CRON which sends HTTP requests to a loadbalancer who then choose an instance of the application to actually do the work. – Alarid Feb 10 '20 at 10:35
0

The easiest option (2. of abovementioned) would be to parametrize task run, so that through configuration you actiate the tasks on one instance and disable on all others. Although this approach seems simple and reasonable, but you would run into problems, when you will have to scale your swarm up and down.

What would happen if the instance with tasks activated gets killed? What migration is needed? All in all it's a big problem zone.

Another option would be to factor the thread code out to a "worker-style" app.

You would have to redesign and modularize your app, so that the tasks can run outside your original Grails app. In this case you can scale your main app and worker app freely and independently.

The worker app doesn't have to be based on Grails, and you can pick any other framework with native Groovy support, like Vert.x (recommended), Micronaut or Ratpack.

injecteer
  • 20,038
  • 4
  • 45
  • 89
  • The thing is, the application is deployed automatically using Gitlab-CI/CD, so I don't think the option to parametrize task run manually on one instance is actually an opion. It would have to be parametrized automatically on one of the 3 instances, but even so, I don't think it is a good idea. Your suggestion to redesign and modularize the app is good, unfortunately it might cost too much to do that for our customer – Alarid Jan 29 '20 at 14:34
0

You either change logic for your application to start fewer threads per instance and then instead of one instance with 10 threads start 5 instances with 2 threads each and ask Swarm to scale it for you.

Another option is to connect several instances into one "cluster" and use some mechanism to elect the leader and start all treads only on the leader node. And then if leader goes down you need to reelect leader and restart tasks on a new leader.

Ivan
  • 8,508
  • 2
  • 19
  • 30