Deterministic and non uniform long string generation from seed

Question

I had this weird idea for an encryption that I wanted to try out, it may be bad, and it may have done before, but I'm just doing it for fun. The short version of the question is: Is it possible to generate a long, deterministic and non-uniformly distributed string/sequence of numbers from a small seed?

Long(er) version: I was thinking to encrypt a text by changing encoding. The new encoding would be generated via Huffman algorithm. To work well, the Huffman algorithm would need a fairly long text with non uniform distribution. Then characters can have different bit-lengths which would be the primary strength of this encryption. The problem is that its impractical to enter in/remember a long text each time you want to decrypt the text. So I was wondering if it was possible to generate a text from password seed?

It doesn't matter what the text is, as long as it has non uniform distribution of characters and that the exact same sequence can be recreated each time you give it the same seed. Preferably, are there any functions/extensions in Python that can do this?

EDIT: To expand on the "strength" of varying bit length: if I have a string "test", ASCII values 116, 101, 115, 116, which gives bit values of 1110100 1100101 1110011 1110100

Then, say my Huffman algorithm generates encoding like t = 101 e = 1100111 s = 10001

The final string is 101 1100111 10001 101, if we encode this back to ASCII, we get 1011100 1111000 1101000, which is 3 entirely different characters, its "\xh". Obviously its impossible to perform any kind of frequency analysis or something like that on this.

+1 for the nice idea and the good description of the approach. I'm curious, why do you think that encrypting using "different bit-lengths" for the characters will be a strength of this encryption? Do you think that crackers only can work on fixed bit-lengths like, say, 8? — Alfe, Oct 23 '13 at 09:40
The way I thought is like this: if I have a string "test", ASCII values 116, 101, 115, 116, which gives bit values of 1110100 1100101 1110011 1110100 Then, say my Huffman algorithm generates encoding like t = 101 e = 1100111 s = 10001 The final string is 101 1100111 10001 101, if we encode this back to ASCII, we get 1011100 1111000 1101000, which is 3 entirely different characters. Obviously its impossible to perform any kind of frequency analysis or something like that on this. — Limon, Oct 23 '13 at 09:48
Frequency analysis can also be done on 7-bit lengths, 6-bit lengths, etc. and I'd be surprised if crackers wouldn't do that on a regular basis. Hiding information would be quite easy otherwise ;-) Building your own code-table (as in using LZW) would also complicate things, but I'm sure the state-of-the-art cracker knowledge encompasses all of this. — Alfe, Oct 23 '13 at 10:08
But how can it be done if they all have varying frequency? Every sequence of x bits in the encoded version can consist of 1 to x encoded characters.. F.ex. if s = 1 and a = 01, then 111111 = "ssssss" and 010101 = "aaa". Here a sequence of 6 bits got converted into to strings of varying size. Not to mention that some characters can overlap, if t = 0011, then 001100 11xxxx is "tt*" where * is some other character — Limon, Oct 23 '13 at 10:19
I'm sure that decent crackers do _not_ rely on byte boundaries or even fixed-width-characters when they try to crack a cypher. But this gets a little out of scope to discuss here. Anyway, keep in mind that state-of-the-art encryption algorithms typically don't use 3, 6 or 9 bit characters, but group together hundreds of characters to have blocks of thousands of bits. Such a block is not likely to be vulnerable to statistical attacks, but any small number of bits per block is a risk. — Alfe, Oct 23 '13 at 10:47
Oh yeah, I'm certain that decent cracker will have no problem with this, I just didn't think it would be vulnerable to simple statistical methods. I have no experience of encryption, but I think this idea isn't too bad given its simplicity. Thanks you for the help! — Limon, Oct 23 '13 at 11:04

DhruvPathak · Answer 1 · 2013-10-23T10:01:45.980

3

This is a solution based on random module, which will generate the same sequence if given the same seed.

import random
from string import ascii_lowercase
from collections import Counter

seed_value = 3334
string_length = 50
random.seed(seed_value)
seq = [(x,random.randint(1,10)) for x in ascii_lowercase]
weighted_choice = lambda s : random.choice(sum(([v]*wt for v,wt in s),[]))
random_list = [weighted_choice(seq) for x in range(string_length)]
print("".join(random_list))
print("Test non uniform distribution...")
print(Counter(random_list))

edited Oct 23 '13 at 10:01

answered Oct 23 '13 at 09:33

DhruvPathak

42,059
16
116
175

Correct me if I'm wrong, but wont the random generate a uniform distribution? Huffman algorithm will them only "shuffle" the letters, instead of actually giving the different lengths. – Limon Oct 23 '13 at 09:40
1

@Limonup , you were correct, I have put a weighted random selection logic which will create non uniform distribution. – DhruvPathak Oct 23 '13 at 10:02

Alfe · Accepted Answer · 2013-10-24T08:28:30.563

Based on DhruvPathak's straight forward answer with creating a simple random string of characters, I have two additions: ① a non-uniform distribution and ② a random translation to prevent prediction of the frequency of the letters:

translation = range(26)
random.shuffle(translation)  # ②
random_string = ''.join(chr(
  translation[random.randint(0, random.randint(1, 25))] + ord('a'))  # ①
  for _dummy in range(1000))

The non-uniformly distribution is achieved by using randint(randint(…)) which basically prefers the lower numbers as output.

In a first try I got this translation list:

[5, 18, 22, 16, 3, 20, 2, 4, 19, 24, 9, 21, 12, 15, 7, 0, 25, 11, 14, 17, 10, 8, 13, 6, 1, 23]

And a count of the characters in the resulting random_string (done by f = [ 0 ] * 25, for c in random_string: f[ord(c) - ord('a')] += 1, zip(*reversed(sorted(zip(f, range(26)))))[1]) gave this list:

(18, 5, 22, 16, 3, 20, 2, 4, 19, 24, 12, 21, 15, 9, 0, 7, 25, 14, 17, 10, 11, 13, 8, 1, 23, 6)

So, the outcome matches the expectation pretty well.

Ok, this is pretty good. But wont the translation table make it non-deterministic? EDIT: Unless seed also affects shuffle? — Limon, Oct 23 '13 at 09:57
That seeding affects the whole random generator, so everything stays perfectly deterministic. If you need several instances, you also can create random-generator instances in that module and use them (instead of the convenience functions like `random.shuffle` etc.). — Alfe, Oct 23 '13 at 09:59
Yes, just tested it myself. Works just the way I wanted it. Thank you! — Limon, Oct 23 '13 at 10:11

Deterministic and non uniform long string generation from seed

2 Answers2