-2

Semantically secure algorithms can create many different ciphertext values for any single plaintext, mostly through the use of a salt or initialization vector. However, that is an obvious issue for detecting duplicate values in a column, as there won't be any!

For example, when I use AES-256-CBC to store a base64-encoded value in <iv>$<ciphertext> format into a table, plaintext "SMITH, JOHN" may be stored as

fnhY58/hCd78VjY51YL8QA==$B9W1bEUWkK4gqyWlzJULMA==
9d9p9/L3Kla9oA/pOUFo1g==$5notB0QSZaGJ+/M5dpTJ/Q==
Y3NETmg53010roB6iMBMVA==$55Ap+vKHYqLlUDNzAhAwZg==
...

If that column is marked as a UNIQUE index or a PRIMARY KEY, duplicate detection and suppression (e.g. INSERT ON DUPLICATE KEY UPDATE) will silently fail, as the ciphertexts will obviously be different.

(I guess this may be the reason why MariaDB's built-in AES functions use ECB mode, and MySQL's default to it unless you tell them otherwise...)

Is there any practical way of using semantically secure encryption while preserving duplicate detection?

I guess one workaround would be to create an additional column with a securely derived hash of the plaintext using a KDF (using identical salt and iterations parameters), and using that for duplicate detection. But does that not defeat the point of semantic security by allowing the attacker to recognize related ciphertexts?

I imagine this use case is not uncommon, so I'm guessing there is an established workaround that I'm just not aware of. Thanks in advance for the advice!

Branko B
  • 1
  • 4
  • 1
    Add a column which produces a hash (which is non-reversable and provides the probability of the collision which is small enough, can be fixed-length binary string) as table PK. According ciphertext value should be stored in another (maybe indexed) column. – Akina Jan 11 '23 at 19:05
  • This is not an answer and is geared more towards PHP, but [this blog post](https://paragonie.com/blog/2017/05/building-searchable-encrypted-databases-with-php-and-sql) may nevertheless be informative and give you some ideas – President James K. Polk Jan 11 '23 at 21:14
  • @PresidentJamesK.Polk - That is very useful information! It is along the lines of what I was thinking of, but it's good to have it spelled out. Thanks for your help! – Branko B Jan 15 '23 at 03:36

1 Answers1

0

too long for a comment...

SEM

"Semantically secure encryption" (SEM) means, that when knowing the encrypted data only the length of decrypted data will be available.

For block cipher modes like AES-CBC or AES_ECB the original length can be only detected by removing padding bytes after decryption. In your example (IV/CT) we only know that decrypted data has a size between 0 and 15 bytes.

Example (usign AES_ECB):

select length(aes_encrypt("1", "secret128bit-key")) as length union all 
select length(aes_encrypt("SMITH, JON", "secret128-bit-key"));
+--------+
| length |
+--------+
|     16 |
|     16 |
+--------+

Initialization vector

Using different IV (initialization vectors) doesn't make much sense, unless you use AEAD (where IV can be authenticated). An IV is not considered to be secret, therefore think about only to use one IV - MariaDB for example uses one IV for binary log encryption.

Unique data

The server can determine if a encrypted value is unique only under the following conditions:

  • he will be able to decrypt the data for comparison (which implies that server knows key and iv)
  • iv and key are constant
Georg Richter
  • 5,970
  • 2
  • 9
  • 15
  • My definition of "semantically secure" may be too broad (or yours too narrow). I do not believe your solution would be cryptographically secure. Most sources I consulted strongly discourage the use of a fixed IV, as it opens encryption to a variety of attacks. – Branko B Jan 15 '23 at 03:41