You are asking a lot of questions, all wrapped up in one big ball. I suspect you don't even know you're asking some of them.
I've tried to pick out the important items and offer you some guidance.
I worry slightly that the data is less secure because I have to actively make backups of it.
Encrypt your backups. Any backup software worth using (e.g. Bacula) will support this.
The site has been down on several occasions for half days because I needed time to respond and once multiple days because of UPS & network failure.
I don't like that.
Nobody does, but it happens. If you want to avoid this you need distributed redundant copies of your site, preferably parallel redundancy (where requests are going to all of the copies all of the time, and data is magically synchronized between them.
Think Google, because that's the kind of budget we're talking about here. In the uptime game, Nines cost Dollars.
I've been looking at server-clustering, XEN based, solutions as well as PaaS solutions. I find that no PaaS could ever provide my required level of security. I'm considering splitting into low-sec and high-sec operations but that will only move my hosting problem.
It sounds like you're looking at the wrong solutions, because if I do not require extreme scalability (yet, I hope :) or perfect uptime, but I'd naturally like them.
is true the most economical solution would be to find a different datacenter (with better infrastructure and a stricter SLA).
You're chasing things designed for fast in order to achieve reliable -- the two are not mutually exclusive (in fact they're somewhat symbiotic), but they're also not conjoined twins.
Halting for minutes is acceptable. Losing active memory sucks. Losing disk data is unacceptable. Breach of security (publicized data) is unacceptable. I only care for the single application to survive, not about cron jobs or the OS it runs on (as long as it's paranoidly secure, prefer OpenBSD).
OK, Halting for minutes is acceptable
means you're being reasonable. That's good. We like reasonable people around here.
Losing active memory sucks
- I agree with you there Skippy, I just don't think you have a cause of action.
Servers crash. It happens even in the best-maintained environments, and when a server reboots or loses power active memory (RAM) is gone. Not much you can do about that - that's what's supposed to happen.
Losing disk data is unacceptable
- Aw man, now you're being un reasonable.
In the real world disks fail. When that happens they take all their data with them, and you lose everything done since the last backup. This is why we make backups (often enough that we won't lose too much important data).
Since you're already making backups you're doing everything you can to mitigate this, so when the inevitable disk failure (or OS crash and corruption) happens my advice is to punch a kitten in the face and start your restore process.
(You do have a restore process, and you test it regularly, right? :-)
Breach of security (publicized data) is unacceptable
-- I'm just going to say "Well DUH" to that and move on. I can't think of any service where having data exfiltrated is considered "acceptable".
[I don't care] about . . . the OS . . . (as long as it's paranoidly secure, prefer OpenBSD)
-- Security is NOT a function of the operating system, it's a function of the configuration you apply on top of it. I can make an insecure OpenBSD machine in about 5 seconds.
Forget all the marketing hype, and forget OpenBSD's (admittedly impressive) track record: Pick the OS that meets your needs, and then spend time securing it. Yes you still have to do that for an OpenBSD box too.
The question: How do I run an application (Linux and BSD compatible) in a way it will never die on a cluster of servers?
You don't. The best you can do on most clusters (or single systems for that matter) is monitor for app failures and restart it quickly enough that your users don't notice.
The closest you're going to come to the kind of thing you are describing is setting up something like VMWare HA (across geographically distributed sites if network/datacenter(power) issues are a real concern), and failing the whole (virtual) environment over if one site goes down.
edit: In response to your requests to clarity: It is a web service for secure storage of private keys, meaning an API that is accessible over the Internet and performs private key operations after having been cleared.
I hope you don't take this the wrong way, but you can have my private keys when you pry them from my cold, dead, lifeless hands. Anyone who doesn't share that philosophy is insufficiently paranoid about data security. :-)