29

I'm building an application that needs to store sensitive information, which means the data is encrypted on my database so that a hacker/employee with access to the database cannot decipher the sensitive data. However, it still needs to be searchable (on a certain level).

I understand certain compromises may have to be made. For example, I'm willing to leave some data attributes unencrypted to make them indexable if necessary, but "the main body" must be encrypted.

What are some best practices and approaches for storing sensitive data that needs to be viewable, searchable, and/or sortable by authorized people?

(I was thinking of extracting non stop words from the "body" and putting them in random order in a field before encrypting the body, and then feed that field to a search indexer, I doubt it provides any real security.)

Pacerier
  • 86,231
  • 106
  • 366
  • 634
dan
  • 43,914
  • 47
  • 153
  • 254
  • Hmm, that's a tough one - you'd need to have the database automatically decrypted *under certain conditions only*? – Piskvor left the building Feb 10 '11 at 19:25
  • 1
    @Piskvor, No decrypting the entire database would sort of be defeating the purpose. What is needed is a database of encrypted information. Like a docs.google.com that stores all our documents encrypted, yet still searchable. – Pacerier Jun 09 '14 at 22:49

8 Answers8

13

Update: You'll want to check out CipherSweet instead of rolling your own design. It takes care of a lot of subtle security details and has a straightforward security argument.


Hash functions aren't the solution here. As the accepted answer suggests, indexing encrypted data requires a "blind index", facilitated by a MAC.

Let's say you're encrypting social security numbers. When you insert them into the database, you might do something like this:

$ssn_encrypted = \Defuse\Crypto\Crypto::encrypt($ssn, $our_encryption_key);
$ssn_blind_idx = \hash_hmac('sha512', $ssn, $our_search_key);

And then store both values in the database. When you need to quickly grab a value based on an SSN input, you can recalculate the HMAC and search based on that.

The database never sees the SSN, and your encryption keys should never be checked into source control (SVN, git, etc.).

Scott Arciszewski
  • 33,610
  • 16
  • 89
  • 206
  • This made more sense to me when I realised that `$our_search_key` is actually an encryption key, I.e. `$our_search_encryption_key` – Chris Snow Apr 03 '16 at 19:48
  • It's a cryptographic key, but not an encryption key. It's just meant for authentication to make the index truly blind to anyone that doesn't possess it. – Scott Arciszewski Apr 04 '16 at 06:29
  • @ScottArciszewski what about partial string match? When you need to match part of string, which is encrypted? In this case, DB column will store hash of full string, but how to match that row, with only part of search string? ..'WHERE blind_idx LIKE %search_string%' won't work here.... – Andrew Aug 16 '16 at 15:50
  • Partial string searches are not compatible with encrypted field storage, given known-to-be-secure encryption modes. – Scott Arciszewski Aug 17 '16 at 15:15
  • @ScottArciszewski Could we store hashes of ngrams? (I'm not sure how having hashes of many ngrams could affect security.) – Raphael Dec 13 '16 at 13:34
9

I'm currently looking for a solution to this same problem.

One of the best ideas I've found, is this article from Raul García, https://learn.microsoft.com/en-us/archive/blogs/raulga/indexing-encrypted-data.

He suggests using a MAC, to create an indexable column. The solution is for MS SQL Server, but it could be applied to another system.

erangel
  • 108
  • 5
4

You need to use a new class of encryption algorithms called Format Preserving Encryption (search Wiki).

I would be judicious in using such algorithms off-hand simply for the reason that they are relatively new to the literature and it is a thumb rule that you wait for an algorithm to be crypt-analyzed for (say) a decade before you can use it for serious purposes. I am also not sure if there are any standards for such encryption formats. There is only a draft for standard that was submitted in 2010. http://csrc.nist.gov/groups/ST/toolkit/BCM/documents/proposedmodes/ffx/ffx-spec.pdf

So, consider using it judiciously. Do not rely on format-preserving encryption for information that needs a secrecy span of more than (say) 5 years.

Kapali
  • 49
  • 1
  • 2
  • "Format Preserving Encryption" is not a solution for this. From wiki, "format-preserving encryption (FPE), refers to encrypting in such a way that the output (the ciphertext) is in the same format as the input (the plaintext). The meaning of "format" varies. Typically only finite sets of characters are used; numeric, alphabetic or alphanumeric. For example: Encrypting a 16-digit credit card number so that the ciphertext is another 16-digit number. Encrypting an English word so that the ciphertext is another English word.".it is mainly used when introducing encryption to legacy apps. – Manjula May 04 '22 at 06:57
2

The reality is you will not benefit from indexes if you encrypt the data. You need to accept this.

If an index is needed, then protect the data by removing permissions to those columns on the DBA accounts. Only the application account will be able to query these columns. The security is in the limited access rather than encryption.

You have to accept trade-offs. I hope someone comes in with a wiz bang answer that proves me wrong!

Lord Tydus
  • 544
  • 2
  • 7
  • 1
    +1. Erangel's link is excellent. However keep in mind: that is an MSDN **blog article** exploring a possible idea, and Dan wants solid tools or techniques for production use, with no risk of downtime or data loss. – JasonSmith Feb 15 '11 at 04:47
  • A note, well, more of a plug, but our UniVerse database does support encrypted columns, tables, primary keys and even indexes ("which means,among other things, that your queries can still perform effective equal, ordered, or range comparisons based on indexes.") – Dan McGrath Oct 12 '12 at 22:22
  • @DanMcGrath, How did they do it? What's the process used? – Pacerier Jun 09 '14 at 23:08
  • I think they have a patent on it somewhere if you search for it. Other than that, I don't think I can talk about the implementation as it is propriety software. – Dan McGrath Jun 12 '14 at 16:08
1

Take the attributes you want to search on and run them through a 1-way hash (MD5, SHA1), store the results as individual columns and index those columns. Then when you need to query a value, run the input (unencrypted) value through the same hash and search for the hashed value.

Jim Garrison
  • 85,615
  • 20
  • 155
  • 190
  • Assuming full text search (which is the most common), simple dictionary attack renders the data "no longer encrypted". As such, this approach is actually almost no better than simply indexing the plaintext. – Pacerier Nov 01 '14 at 15:05
  • The question has been edited significantly since I answered. There was no mention of full-text search at that time. – Jim Garrison Nov 01 '14 at 22:37
  • Even if it's *half* text search, then half of that document is leaked. Of course the search term has to be based on the contents of the document, which means we are leaking information regarding the document. – Pacerier Nov 02 '14 at 19:07
  • Equality search based on one or more specific columns that are encrypted is easy, as described in my answer. Anything else, such as range based or full-text search of an encrypted field is a totally different problem. – Jim Garrison Nov 04 '14 at 04:52
  • I'm saying that the one or more specific attributes you are talking about needs to be encrypted and not merely hashed, because hashing does not guarantee their **secrecy**. – Pacerier Nov 04 '14 at 15:11
1

Store the encrypted blobs but create separate indexing tables that are tied to the blobs using encrypted relations. For example, the following table could store your blobs:

blob(ID,SHA(secret-seed,data))

and the indexes could be related to the blob as such:

word(SHA(secret-seed,blob-ID),value)

Now when you query for some blob you do:

select blob join word on SHA(secret-seed,ID) = word-ID where query IN value

You could even use different seeds for the keys and actual blob data.

0

There are databases that do support encrypted indexes. The one I know (since I worked for the company) is UniVerse.

Check out the security manual(1) 'Automatic Data Encryption' section. Perhaps it will give you some ideas.

(1): http://docs.rocketsoftware.com, search for "UniVerse Security Features"

Dan McGrath
  • 41,220
  • 11
  • 99
  • 130
  • With relation to this thread, Which chapter of that pdf are you talking about? – Pacerier Nov 02 '14 at 19:05
  • Do you know of any independent evaluations of the security of encrypted indexes? Intuitively, anything other than strict-equality indexing seems like it must *inevitably* compromise security. And if it is just strict-equality indexing, I'd guess the solutions outlined in other answers on this question are superior, if for no other reason than they are specific and concrete. – Kenny Evitt Dec 03 '19 at 05:17
  • The Rocket Software docs site is pretty unfriendly – no nice links to specific documents for others to share for one. I found a UniVerse Security Features PDF for one version and on page 234 of that 401 page document, in the section "Encrypting indexes", it seems like the index data itself ("index file") is encrypted, but only at rest as "UniVerse also ensures that encrypted indexes can be used the same way as clear text indexes, which means, among other things, that your queries can still perform effective equal, ordered, or range comparisons based on indexes.". – Kenny Evitt Dec 03 '19 at 05:27
  • This note from page 235 also seems to imply that UniVerse 'encrypted indexes' aren't *indexes of encrypted data*: "Note: You can encrypt a virtual field index, but you cannot encrypt the virtual field itself. If any data field in a file is encrypted, you should analyze all virtual field indexes and directly encrypt any index that involves the encrypted data field. Otherwise, you run the risk of exposing the encrypted data in clear text through an unencrypted virtual field index file." – Kenny Evitt Dec 03 '19 at 05:29
  • @KennyEvitt I haven't worked at Rocket for several years, so I'd suggest messaging them for some of your questions here. – Dan McGrath Dec 03 '19 at 17:43
  • " 'encrypted indexes' aren't indexes of encrypted data" -> I think you've misunderstood virtual indexes somewhat. Virtual indexes enable you to create multiple views on data, and you get to determine which views are encrypted or not. For example, a Virtual index might create an age bucket from an age field (e.g 0-10, 11-25, etc) which you don't want encrypted, but actual age you do. – Dan McGrath Dec 03 '19 at 18:04
  • First, thanks for the answer! I don't think it's of much value now and based on the security document I read, the UniVerse encrypted indexes don't seem like a solution for this question, but your suggestion to look into it was and is reasonable. I'm not personally interested in this question – I was actually helping cleanup a related question on another SE site and found this question and your answer. Given the other answers I'd read, I was curious if Rocket really had something unique for this, but I don't think they do. – Kenny Evitt Dec 03 '19 at 22:39
  • Based on what I've read, and my own (modest) cryptographic intuition, there's no way to create a *searchable* index of *encrypted* data without compromising the security of the encryption. The UniVerse document, or rather the portion I read, makes me think that UniVerse ultimately works with un-encrypted data for things like partial string matches. That's totally fine, and probably best overall, but the same thing is possible in other DB systems. The encrypted indexes seem like encryption for the entire index data structure, on disk, and not some way to search data without first decrypting it. – Kenny Evitt Dec 03 '19 at 22:42
  • I might have misunderstood virtual indexes in UniVerse, but it doesn't possible to create an *unencrypted* view of data that's encrypted – without decrypting that data. And if the index persists, even just in the DB server's process memory, and the data in the index is not encrypted, it seems slightly more precise to describe those indexes as encrypted indexes versus indexes of encrypted data. It also seems like storing an age bucket unencrypted would inevitably compromise any encryption of the actual age values. – Kenny Evitt Dec 03 '19 at 22:46
  • I did describe it as encrypted indexes (and almost all DBs I know do it this way). – Dan McGrath Dec 04 '19 at 18:03
0

The main problem in your scenario is that encryption and availability for indexing / search are contradictory parameters.

Here's the artificial but simple example of the problem: Imagine we are looking for "child porn" in business e-mail. The DB is encrypted, everything is fine. But if the search reveals that the e-mail from John to Bill contains both of these words by finding this e-mail when searching for "child porn", then the actual contents don't matter anymore - child porn should not be discussed in e-mail at all.

So if the DB leaks together with indexes, smart analysis of the word set can reveal plenty of information. For example finding that 50% of corporate mail of software vendor company includes "webos" term can reveal the [possibly secret] fact, that the company works on software for webos.

Now you see, that encryption has limited usefulness in your case. Stronger overall security of the DB might be more important than encryption.

Eugene Mayevski 'Callback
  • 45,135
  • 8
  • 71
  • 121
  • I don't think one would have an encrypted database be searchable with unencrypted text. The intention would be that anyone wanting to do a search would need to know the encryption and key. The difficulty is that if one wants to allow an efficient search for e.g. entries starting with "chil", then the database must have some means of distinguishing entries that start with "chil" from those which start with other characters without having to decrypt all the entries. That can only really be done by having the indices themselves encrypted; a DB engine can do that more more easily than the client. – supercat Jun 09 '14 at 23:12
  • @Eugene, Your point is a non-argument. Obviously a search without the **key** is supposed to give you 0 results. The search input is no longer simply the text input itself, but the duet of the key and the text input. – Pacerier Nov 01 '14 at 15:10