12

I need to obfuscate lines of Unicode text to slow down those who may want to extract them. Ideally this would be done with a built in Python module or a small add-on library; the string length will be the same or less than the original; and the "unobfuscation" be as fast as possible.

I have tried various character swaps and XOR routines, but they are slow. Base64 and hex encoding increase the size considerably. To date the most efficient method I've found is compressing with zlib at the lowest setting (1). Is there a better way?

Tim
  • 187
  • 1
  • 1
  • 10
  • 2
    Use a good, proven, widely-used encryption scheme. Everything else is broken as soon as anyone competent gets an idea of what you're doing. Yes, it will take some time, but that's the price you *have* to pay if you want anything remotely decent. If the data isn't even important enough for that, just save yourself the hassle and send it as plain text. –  Sep 20 '11 at 17:17
  • You said "slow down", not "prevent". Are you really trying to prevent people from reading the text? Under what conditions does the original text need to be read? – wberry Sep 20 '11 at 17:47
  • Speed is more important to me than security. Encryption slows the access of the data significantly. If there was an encryption scheme that did not cause a significant bottleneck, a lot of developer's issues would be solved. Is there not room for a middle ground with this issue? Sure, someone with the knowledge and some spare time could get the data, but is it worth their effort. – Tim Sep 20 '11 at 17:55
  • 2
    This issue is similar to the quality of locks on ones home. A bank vault door would be the most secure, but how many have that? Most realize it is nearly impossible to keep a determined intruder out, but they still have relatively weak locks on their doors to keep the less determined out. That is all I want with this obfuscation. I just don't want to leave the doors to my house wide open. – Tim Sep 20 '11 at 18:04
  • 1
    @Tim: "quality of locks" reasnoning is faulty. Software is not the same as a crowbar for opening a mechanical lock. Once the algorithm is known the "slow down" effect immediately drops to zero. Bank vaults with known combinations are as useless as screen doors with simple hooks to keep the closed in the wind. Same with this "obfuscation". Once the obfuscation is known, the slow-down drops immediately to zero. – S.Lott Sep 20 '11 at 18:31
  • Yes, it drops to near zero when the intruder knows algorithm and how to implement it, but that takes effort. Really, the only difference between simple obfuscation and military grade encryption is the effort one or many want to put forth to break it. However, to the casual observer using a text editor, one looks as garbled as the other. I venture most will not take the time to mess with it. – Tim Sep 20 '11 at 19:51
  • @S.Lott, if the idea is to hide text that the program decrypts for itself, any skilled programmer who wants to make the effort can crack it no matter what algorithm you use, because the program includes all the needed information. I gather Tim simply wants to stop people who will just look at the code with a hex dump program. – Tom Zych Sep 20 '11 at 20:06
  • @Tom Zych: "the program includes all the needed information". Except, of course, the run-time-supplied key that's used. Lacking that key passphrase makes the encryption more difficult to break. – S.Lott Sep 20 '11 at 20:09
  • @S.Lott: Did anyone say the key would not be incorporated into the program? I got the impression this would be self-contained. – Tom Zych Sep 20 '11 at 20:12
  • @Tom Zych: "self-contained" and using a pass-phrase at startup are common. SSL, for example, works that way inside Apache. Hardly a limitation. – S.Lott Sep 20 '11 at 20:21
  • @Tom: You are correct. I just want the text to be scrambled for the average user so they can't easily copy/paste, extract, etc. and "borrow" the data. Speed is most crucial. I have over 31,000 lines of text in an sqlite DB and true encryption slows searching down to a crawl. zlib compression is much better but still considerably slower than plain text. – Tim Sep 20 '11 at 20:23
  • @Tim: "true encryption slows searching down to a crawl"? Why? You only encrypt as part of downloading. Why encrypt what's in the database? – S.Lott Sep 20 '11 at 20:25
  • @S.Lott: Because the DB is distributed with the app and can be easily opened with an SQLite Viewer (Firefox). Sqlite encryption (APSW) is also very slow. zlib is the fastest method I have found. Short of making a plain text index (which would be accessible via a viewer as well), I haven't found a better way. – Tim Sep 20 '11 at 21:38
  • @Tim: How big is this DB? Is it over 1GB? If not, why is it a relational database and not pickled Python structures that you decrypt once when you load them into memory? Why mess with a (slow) RDBMS that's distributed with an app? – S.Lott Sep 21 '11 at 03:21

4 Answers4

17

How about the old ROT13 trick?

Python 3:

>>> import codecs
>>> x = 'some string'
>>> y = codecs.encode(x, 'rot13')
>>> y
'fbzr fgevat'
>>> codecs.decode(y, 'rot13')
u'some string'

Python 2:

>>> x = 'some string'
>>> y = x.encode('rot13')
>>> y
'fbzr fgevat'
>>> y.decode('rot13')
u'some string'

For a unicode string:

>>> x = u'國碼'
>>> print x
國碼
>>> y = x.encode('unicode-escape').encode('rot13')
>>> print y
\h570o\h78op
>>> print y.decode('rot13').decode('unicode-escape')
國碼
wisbucky
  • 33,218
  • 10
  • 150
  • 101
jterrace
  • 64,866
  • 22
  • 157
  • 202
  • 1
    Updated for unicode - just escape it first. The escaping adds no overhead for non-unicode characters. – jterrace Sep 20 '11 at 18:49
  • Didn't know about the unicode-escape, nice. However, it takes roughly twice as long as zlib.compress(s, 1). One would think a simple substitution would be faster, but not according to my quick tests. – Tim Sep 20 '11 at 19:07
  • One note is that `rot13` only transforms the 26 letters of the English alphabet. Numbers and punctuation will not be obfuscated. – wisbucky Oct 20 '21 at 20:24
12

This uses a simple, fast encryption scheme on bytes objects.

# For Python 3 - strings are Unicode, print is a function

def obfuscate(byt):
    # Use same function in both directions.  Input and output are bytes
    # objects.
    mask = b'keyword'
    lmask = len(mask)
    return bytes(c ^ mask[i % lmask] for i, c in enumerate(byt))

def test(s):
    data = obfuscate(s.encode())
    print(len(s), len(data), data)
    newdata = obfuscate(data).decode()
    print(newdata == s)


simple_string = 'Just plain ASCII'
unicode_string = ('sensei = \N{HIRAGANA LETTER SE}\N{HIRAGANA LETTER N}'
                  '\N{HIRAGANA LETTER SE}\N{HIRAGANA LETTER I}')

test(simple_string)
test(unicode_string)

Python 2 version:

# For Python 2

mask = 'keyword'
nmask = [ord(c) for c in mask]
lmask = len(mask)

def obfuscate(s):
    # Use same function in both directions.  Input and output are
    # Python 2 strings, ASCII only.
    return ''.join([chr(ord(c) ^ nmask[i % lmask])
                    for i, c in enumerate(s)])

def test(s):
    data = obfuscate(s.encode('utf-8'))
    print len(s), len(data), repr(data)
    newdata = obfuscate(data).decode('utf-8')
    print newdata == s


simple_string = u'Just plain ASCII'
unicode_string = (u'sensei = \N{HIRAGANA LETTER SE}\N{HIRAGANA LETTER N}'
                  '\N{HIRAGANA LETTER SE}\N{HIRAGANA LETTER I}')

test(simple_string)
test(unicode_string)
Tom Zych
  • 13,329
  • 9
  • 36
  • 53
  • Looks good, but I didn't mention I am limited to Python 2.5-2.7. I don't even have 3.x on my system to test with. What is the 2.x equivalent to the bytes module? – Tim Sep 20 '11 at 18:35
  • @Tim: Try this. I only have 2.6 on this machine and it looks like it didn't interpret the `\N{...}` stuff, so I couldn't test it as completely as I'd like to. – Tom Zych Sep 20 '11 at 19:03
  • Thanks for the 2.x version. Unfortunately, it is much slower than zlib.compress(). For a 6MB file this takes about 5.1 sec. With zlib it takes .18 sec. About 25 times faster. – Tim Sep 20 '11 at 20:13
  • Probably the difference between C code and native Python code. Go with zlib, then. – Tom Zych Sep 20 '11 at 20:14
  • 1
    One should also be carefull when dealing with the obfuscated string, since it may have \x00 (string terminator character) in it. In my case, it was truncating a obfuscated password when stored in the database. – romaia Feb 01 '17 at 15:41
  • @TomZych Thanks for your code. I am using it to obfuscate sqlite rows. It is hunderesd of thousands 1k fields. It is now a bottleneck. Slowest part of the code. How we can make it faster ? – 2adnielsenx xx Apr 19 '22 at 07:48
2

use codecs with hex encoding , like :

>>> codecs.encode(b'test/jimmy', 'hex')
b'746573742f6a696d6d79'
>>> codecs.decode(b'746573742f6a696d6d79', 'hex')
b'test/jimmy'
Jimmy Obonyo Abor
  • 7,335
  • 10
  • 43
  • 71
1

It depends on the size of your input, if it's over 1K then using is about 60x faster (runs in less than 2% of the naïve Python code).

import time
import numpy as np

mask = b'We are the knights who say "Ni"!'
mask_length = len(mask)

def mask_python(val: bytes) -> bytes:
    return bytes(c ^ mask[i % mask_length] for i, c in enumerate(val))

def mask_numpy(val: bytes) -> bytes:
    arr = np.frombuffer(val, dtype=np.int8)
    length = len(value)
    np_mask = np.tile(np.frombuffer(mask, dtype=np.int8), round(length/mask_length+0.5))[:length]
    masked = arr ^ np_mask
    return masked.tobytes()


value = b'0123456789'
for i in range(9):
    start_py = time.perf_counter()
    masked_py = mask_python(value)
    end_py = time.perf_counter()

    start_np = time.perf_counter()
    masked_np = mask_numpy(value)
    end_np = time.perf_counter()

    assert masked_py == masked_np
    print(f"{i+1} {len(value)} {end_py-start_py} {end_np-start_np}")
    value = value * 10

Table of results

Note: I'm a novice with numpy, if anyone has any comments on my code I would be very happy to hear about it in comments.

Motti
  • 110,860
  • 49
  • 189
  • 262