Immortal application

Question

(TL;DR: last paragraph) I'm running an online service and have thusfar been doing offline backups and simple monitoring to achieve resilience and availability.

Resilience is rather manual but I'm fairly confident it the data will survive. I worry slightly that the data is less secure because I have to actively make backups of it. The site has been down on several occasions for half days because I needed time to respond and once multiple days because of UPS & network failure.

I don't like that.

I've been looking at server-clustering, XEN based, solutions as well as PaaS solutions. I find that no PaaS could ever provide my required level of security. I'm considering splitting into low-sec and high-sec operations but that will only move my hosting problem.

I do not require extreme scalability (yet, I hope :) or perfect uptime, but I'd naturally like them. Halting for minutes is acceptable. Losing active memory sucks. Losing disk data is unacceptable. Breach of security (publicized data) is unacceptable. I only care for the single application to survive, not about cron jobs or the OS it runs on (as long as it's paranoidly secure, prefer OpenBSD).

The question: How do I run an application (Linux and BSD compatible) in a way it will never die on a cluster of servers?

edit: In response to your requests to clarity: It is a web service for secure storage of private keys, meaning an API that is accessible over the Internet and performs private key operations after having been cleared. The private keys are valuable and must not be lost. These keys are synced with disk, so maintained memory is not required. By immortal I mean that it may be suspended but it must be able to continue after that suspension. Kernel upgrades will not be a significant problem because it may get planned downtime. This is beginning to look like a replicated disk and automatic failover problem.

It would probably help if you were more explicit about what "online service" means (i.e. web site, web service, other?). It would also help if you gave more detail about the current stack your application runs on. — Ryan Bolger, Jul 18 '12 at 14:41
well, I've seen well written unix app's run for years without a reboot. However, if a kernel patch arrives (security) then you can't do without a reboot. You may look at ksplice which attempts to start a new kernel, taking over from the old one, and application survive. power , network and hardware failure can be reduced by choosing a provider that can provide redundancy and clustering at all levels. Xen allows for seams migration of o/s and apps to a new hardware. this might be what your looking for. Your application has state? where is this state stored? db? memory? is it stateless? — The Unix Janitor, Jul 18 '12 at 14:46
More information is needed! How does the service work? You talk about disk data - what kind of disk data? Plain files, a database... what? — Jenny D, Jul 18 '12 at 15:03
The disk data is plain files. I might make it into a database but it'd be a lot of effort. The application is statefull but checkpointed. State is stored on disk or it may be lost. I want it to survive failure of the machine it's running on, or the network it is connected to. Being booted elsewhere and having synced disks may be acceptable. — Lodewijk, Jul 18 '12 at 15:18

score 4 · Answer 1 · answered Jul 18 '12 at 16:16

You are asking a lot of questions, all wrapped up in one big ball. I suspect you don't even know you're asking some of them.
I've tried to pick out the important items and offer you some guidance.

I worry slightly that the data is less secure because I have to actively make backups of it.

Encrypt your backups. Any backup software worth using (e.g. Bacula) will support this.

The site has been down on several occasions for half days because I needed time to respond and once multiple days because of UPS & network failure.

I don't like that.

Nobody does, but it happens. If you want to avoid this you need distributed redundant copies of your site, preferably parallel redundancy (where requests are going to all of the copies all of the time, and data is magically synchronized between them.
Think Google, because that's the kind of budget we're talking about here. In the uptime game, Nines cost Dollars.

I've been looking at server-clustering, XEN based, solutions as well as PaaS solutions. I find that no PaaS could ever provide my required level of security. I'm considering splitting into low-sec and high-sec operations but that will only move my hosting problem.

It sounds like you're looking at the wrong solutions, because if I do not require extreme scalability (yet, I hope :) or perfect uptime, but I'd naturally like them. is true the most economical solution would be to find a different datacenter (with better infrastructure and a stricter SLA).
You're chasing things designed for fast in order to achieve reliable -- the two are not mutually exclusive (in fact they're somewhat symbiotic), but they're also not conjoined twins.

Halting for minutes is acceptable. Losing active memory sucks. Losing disk data is unacceptable. Breach of security (publicized data) is unacceptable. I only care for the single application to survive, not about cron jobs or the OS it runs on (as long as it's paranoidly secure, prefer OpenBSD).

OK, Halting for minutes is acceptable means you're being reasonable. That's good. We like reasonable people around here.

Losing active memory sucks - I agree with you there Skippy, I just don't think you have a cause of action.
Servers crash. It happens even in the best-maintained environments, and when a server reboots or loses power active memory (RAM) is gone. Not much you can do about that - that's what's supposed to happen.

Losing disk data is unacceptable - Aw man, now you're being un reasonable.
In the real world disks fail. When that happens they take all their data with them, and you lose everything done since the last backup. This is why we make backups (often enough that we won't lose too much important data).
Since you're already making backups you're doing everything you can to mitigate this, so when the inevitable disk failure (or OS crash and corruption) happens my advice is to punch a kitten in the face and start your restore process.
(You do have a restore process, and you test it regularly, right? :-)

Breach of security (publicized data) is unacceptable -- I'm just going to say "Well DUH" to that and move on. I can't think of any service where having data exfiltrated is considered "acceptable".

[I don't care] about . . . the OS . . . (as long as it's paranoidly secure, prefer OpenBSD) -- Security is NOT a function of the operating system, it's a function of the configuration you apply on top of it. I can make an insecure OpenBSD machine in about 5 seconds.
Forget all the marketing hype, and forget OpenBSD's (admittedly impressive) track record: Pick the OS that meets your needs, and then spend time securing it. Yes you still have to do that for an OpenBSD box too.

The question: How do I run an application (Linux and BSD compatible) in a way it will never die on a cluster of servers?

You don't. The best you can do on most clusters (or single systems for that matter) is monitor for app failures and restart it quickly enough that your users don't notice.
The closest you're going to come to the kind of thing you are describing is setting up something like VMWare HA (across geographically distributed sites if network/datacenter(power) issues are a real concern), and failing the whole (virtual) environment over if one site goes down.

edit: In response to your requests to clarity: It is a web service for secure storage of private keys, meaning an API that is accessible over the Internet and performs private key operations after having been cleared.

I hope you don't take this the wrong way, but you can have my private keys when you pry them from my cold, dead, lifeless hands. Anyone who doesn't share that philosophy is insufficiently paranoid about data security. :-)

I hate it when people answer my questions with the considerations I've made myself have. I guess it's quite correct. And I've made such a mess of my question I can't really expect a better answer. I'll just remain displined in backups and look into distributed storage a bit. — Lodewijk, Jul 18 '12 at 18:47
@Lodewijk The answer to your question is "You can't" -- The system won't let me post that as an answer because it's too short, so you get the benefit of a detailed analysis of the issues you raised and what you can do to approximate what you're trying to achieve. Of course if you stop reading halfway through the analysis you don't get to the answer, but I read and parsed your whole question so I have a reasonable expectation that you will read and parse my whole answer :-) — voretaq7, Jul 18 '12 at 18:56

score 1 · Accepted Answer · answered Jul 18 '12 at 18:39

For availability, you need a second server somewhere. If your location is not good enough, second good thing is to buy a server and host it at some colocation datacenter, less secure - rent a dedicated server (like rackspace), less secure - subscribe for a VPS. I guess you don't need to go for less than that, as VPSs are cheap (Amazon EC2 is free for 1 year).

From how you describe your availability requirements, it looks that just adding single VPS to your existing server will be enough.

If for high-sec operations, your single server is enough - you can have a low-sec at VPS. If your server alone is not enough for high-sec - you have nothing to split.

It's not hard to have app that will "never die" on two servers - just sync all your data in realtime between the servers, use some clustered database (I heard of cassandra, but there's a LOT of cluster databases that will fit). It it MUST to be a filesystem files, there's DRDB, but I'd advise to try to go with database anyway and avoid complications. What about two column table: 1. filename-and-path and 2. contents. Then you replace all your savefiles with store-to-DB, and your readfiles with get-from-DB.

That's basically all. There's nothing very complicated nor expensive in what you are trying to achieve.

Disclaimer: You decide which level of security you need, I did not advise to store your sensitive data on hosted servers, nor did I advise otherwise.

Quite frankly this is what I decided to do. It's the hard way though, to just do it clustered. I'd like if there were a virtualization/clustering solution like OpenStack that'd handle this for you. No magic, just hard work. — Lodewijk, Jul 20 '12 at 10:38
I don't think it's a __hard__ work - installing two servers is just like installing one, and choose database that will be just transparently and magically synchronised across your machines (NOT MySQL replication) — Sandman4, Jul 20 '12 at 15:04
Btw, for your uptime requirements, just a __single__ server hosted with any decent provider may be enough - for example we have VPSes at Linode and EC2 up for one year, and the only downtime was ~1 minute for a single planned reboot. — Sandman4, Jul 20 '12 at 15:10

Immortal application

2 Answers2