2

For example, there is a repo for doing this in Django: https://sourcegraph.com/github.com/dcwatson/django-pgcrypto.

There is some discussion in the SQLAlchemy manual, but I am using non-byte columns: http://docs.sqlalchemy.org/en/rel_0_9/core/types.html

I am running Flask on Heroku using SQLAlchemy.

A code example and/or some discussion would be most appreciated.

Soferio
  • 483
  • 6
  • 14
  • 1
    Whenever you're considering crypto, you can't separate the "how" from the "why". What's your goal? What're you trying to protect and *from whom*? – Craig Ringer Jan 20 '14 at 03:40
  • The primary purpose is to have a basic degree of encryption protection over those columns in the event a database backup is 'stolen' or the database files themselves are 'stolen' from the server. I hope this can be achieved without too much code complexity and while avoiding the risk of corrupting data inadvertently by erroneous encryption processes. – Soferio Jan 20 '14 at 10:34

1 Answers1

8

There are a bunch of stages to this kind of decision making, it's not just "shove a plugin into the stack and that encryption thing is taken care of"

First, you really need to classify each column for its attractiveness to attackers & what searches/queries need to use it, whether it's a join column / index candidate, etc. Some data needs much stronger protection than other data.

Consider who you're trying to protect against:

  • Casual attacker (e.g SQL injection holes used for remote table copies)
  • Stolen database backup (tip: Encrypt these too)
  • Stolen/leaked log files, possibly including queries and parameters
  • Attacker with direct non-superuser SQL level access
  • Attacker with direct superuser SQL-level access
  • Attacker who gains direct access to the "postgres" OS user, so they can modify configuration, copy/edit logs, install malicious extensions, alter function definitions, etc
  • Attacker who gains root on the DB server

Of course, there's also the app server, upstream compromise of trusted sources for programming languages and toolkits, etc. Eventually you reach a point where you have to say "I can't realistically defend against this". You can't protect against somebody coming in, saying "I'm from the Government and I'll do x/y/z to you unless you allow me to install a rootkit on this customer's server". The point is that you've got to decide what you do have to protect against, and make your security decisions based on that.

A good compromise can be to do as much of the crypto as possible in the app, so PostgreSQL never sees the encryption/decryption keys. Use one-way hashing whenever possible, rather than using reversible encryption, and when you hash, properly salt your hashes.

That means pgcrypto doesn't actually do you much good, because you're never sending plaintext to the server, and you're not sending key material to the server either.

It also means that two people with the same plaintext for column SecretValue have totally different values for SecretValueSalt, SecretValueHashedBytes in the database. So you can't join on it, use it in a WHERE clause usefully, index it usefully, etc.

For that reason, you'll often compromise with security. You might do an unsalted hash of part of the datum, so you get a partial match, then fetch all the results to your application and filter them on the application side where you have the full information required. So your storage for SecretValue now looks like SecretValueFirst10DigitsUnsaltedHash, SecretValueHashSalt, SecretValueHashBytes. But with better column names.

If in doubt, just don't send plaintext of anything sensitive to the database. That means pgcrypto isn't much use to you, and you'll be doing mostly application-side crypto. The #1 reason for that is that if you send plaintext (or worse, key material) to the DB, it might get exposed in log files, pg_stat_activity, etc.

You'll pretty much always want to store encrypted data in bytea columns. If you really insist you can hex- or base64 encode it and shove it in a text column, but developers and DBAs who have to use your system later will cry.

Craig Ringer
  • 307,061
  • 76
  • 688
  • 778
  • Thank you for your thoughtful answer. I am persuaded by your argument that logs etc mean that crypto probably belongs in the app. What this may mean in my case is shifting all the important text columns to a byte type and encrypting on the app (but keeping uninformative numerical join keys and row ids untouched). If one goes down the road you have sketched out (encryption by app rather than by dbase) are there standard approaches to preventing scrambling of the data by a bug in the encryption process (e.g. inadvertent temporary use of the wrong key)? – Soferio Jan 21 '14 at 00:32
  • @user1642561 There may be application frameworks to help with that, but I'm not aware of anything that integrates into the tools you're using. Sorry. Might be worth posting a separate question about that. – Craig Ringer Jan 21 '14 at 02:31
  • Can I ask you to clarify why bytea should be used rather than string or text column type? – Soferio Jan 23 '14 at 13:34
  • @user1642561 Two reasons: 1. Encrypted data is not, conceptually, a text string. It's just a byte sequence - it has no text encoding, it isn't utf-8 or latin-1, it represents no characters. So for clean modelling a byte type should be used. 2. Efficiency: If you shove it in a `text` field you have to encode those bytes, because Pg's `text` fields can't store `null` bytes and not all byte sequences are valid in all text encodings. You have to `base64` or `hex` or whatever encode it. Then you're wasting storage (bytea is stored 1:1 on disk as plain bytes) *and* making it conceptually wrong. – Craig Ringer Jan 23 '14 at 14:08