0

One of the answers to this question does a good job in explaining how Apache Lucene works, especially the response by Tom Taylor. Here is Tom's response:

Lucene creates a reverse index something like

File 1 :

Term : Random

Frequency : 1

Position : 0

Term : Memory

Frequency : 2

Position : 3

Position : 6

So it is able to search and retrieve the searched content quickly. When there is too many matches for the search query it outputs the result based on the weight. Consider the search query "Main Memory" it searches for all 4 words individually and the result would be like,

Main

File 1 : Frequency - 1

Memory

File 1 : Frequency - 2

File 2 : Frequency - 1

The result would be File1 followed by File2.

My question: Will the above still work if I decide to encrypt "Random" and "Memory" into ciphertext? When I say "still work", I am asking will the search results still be File 1 and File 2 if I search for the cipher text of "Main" and "Memory" ?

In essence, I am asking if it is possible to encrypt the entire Lucene index and use it to perform searches on encrypted queries.

user1068636
  • 1,871
  • 7
  • 33
  • 57
  • No, not unless you are using the type of encryption algorithm where each unencrypted letter is always encrypted to the same target letter. For example, using [ROT13](https://en.wikipedia.org/wiki/ROT13), the letter`c` is always encrypted to the letter`p` - and therefore you can reliably search for the text `png` knowing that you are actually searching for the word `cat`. But that is an incredibly weak form of encryption. Any decent encryption algorithm would not provide such a guarantee. – andrewJames Dec 26 '20 at 21:29
  • 1
    Also, the spaces separating words are typically lost when data is encrypted, so you cannot tell where one word ends and the next word begins. This affects Lucene's ability to create the necessary word tokens for indexing and therefore for searching. – andrewJames Dec 26 '20 at 21:30
  • You can try this for yourself, too. – andrewJames Dec 26 '20 at 21:34
  • 1
    One more thought: It's possible that a custom tokenizer - one which encrypts each separate token (i.e. each separate word), in the analyzer - might be a way forward. I am not sure how practical that would be. – andrewJames Dec 26 '20 at 23:00
  • @andrewjames your last comment is particularly interesting (encrypting tokens) but as you mentioned earlier, even if the encryption was implemented on a per token basis, to work it would have to be encryption that always produces the same cypher text for the same plain text. Such a cypher leaks a lot of information about plain text and probably would be easily broken just based off of term frequency analysis. Encrypted search is a interesting problem but I think a hard problem. – RonC Dec 28 '20 at 15:03
  • 1
    @andrewjames after thinking about this more, a better approach would be to use a cryptographically strong hash of of the term rather than encryption that consistently encrypts the same term text to the same cypher. Using a cryptographically strong hash would be secure whereas the other approach isn't. – RonC Jan 11 '21 at 17:07

0 Answers0