I'm doing a variation of the basic Lempel-Ziv compression in python (2.7). The case is, this algorithm would usually output a list composed by characters and integers, the last ones designating the order of each new string in the dictionary.
Now, imagine we compress a file large enough so there will appear integers up to 400000 or more, so I what I'm doing is to pass each one of those integers to binary, decompose the binary in up-to-8-bit bytes (the binary form of 400000, for example, is a string of 1 and 0 of about 18 or 19 bits, so it could be decomposed in 2 8-bit bytes and a byte of 2 or 3 bits), and that way each 6-character integer will get reduced to a 3-char. string. Note that even the 3-digit integers get reduced to 2-char. strings, so that way the list obtained by the LZW algorithm is more compact.
What happens is, I'm correctly able to compress a file with the code (from 2.2 Mb to 1.5 Mb), or so I think, but when I decompress it I don't obtain the exact same initial text.
Here is my compression code:
def encode(order):
danger = [0, 9, 10, 13, 32, 222, 255, 256]
str2 = ""
str3 = ""
binary = bin(order)[2:]
for bit in binary:
str2 += bit
if len(str2) == 8:
helper = int(str2,2)
if helper in danger:
str3 = chr(222)+str(order) #222 is choosable, may be another ASCII one
str2 = ""
break
else:
str3 += chr(int(str2,2))
str2 = ""
if str2 != "":
helper = int(str2,2)
if helper in danger:
str3 = chr(222)+str(order)
else:
str3 += chr(int(str2,2))
return str3
file_in = open("donquijote.txt")
file_out = open("compressed5.txt","w")
codes = dict([(chr(x), x) for x in range(256)])
danger = [0, 9, 10, 13, 32, 222, 255, 256]
code_count = 257
current_string = ""
string = file_in.read()
for c in string:
current_string = current_string + c
if not current_string in codes:
codes[current_string] = code_count
if (codes[current_string[:-1]] < 257) & (codes[current_string[:-1]] not in danger):
file_out.write(chr(codes[current_string[:-1]])+" ")
else:
str4 = encode(codes[current_string[:-1]])
file_out.write(str4+" ")
code_count += 1
current_string = c
file_out.write(encode(codes[current_string]))
file_in.close()
file_out.close()
Okay, so the tricky part in all this is that as I'm writing the compressed code to a file, and in order to maintain it's "list" format I'm separating each component of the list by a blank space, thus I'm economizing in commas(traditional lists go like ['A', 'B, 'C', ...]). Due to that, I've defined a list - danger - which contains the problematic characters that could make this "phantom list" format vanish, such as spaces, nulls, tabs, etc.. And when one of those appears I maintain it's integer reference to the dictionary by putting in front the same character (I've chosen it to be the 222-corresponding ASCII, though it could be another one), which is also included in the list 'danger'. That way, in the decompression process, when this character appears the code automatically knows the sequence behind him has to be directly saved as a reference for the dictionary, and not to be decoded to binary and mixed up again.
Here is my decompression code:
output = open("compressed5.txt")
descomp = open("decompressed5.txt","w")
text = output.read()
compressed_data = text.split()
strings = dict([(x, chr(x)) for x in range(256)])
next_code = 257
previous_string = ""
binary = ""
a = 1
for element in compressed_data:
for char in element:
if ord(char) == 222:
c = int(element[1:])
break
else:
binary += bin(ord(char))[2:]
if a == len(element):
c = int(binary,2)
a = 1
else:
a += 1
binary = ""
if not (strings.has_key(c)):
strings[c] = previous_string + (previous_string[0])
descomp.write(strings[c])
if not(len(previous_string) == 0):
strings[next_code] = previous_string + (strings[c][0])
next_code +=1
previous_string = strings[c]
output.close()
descomp.close()
I can't see what I'm missing here(I'm quite of a novice in python actually), or if I should consider adding another problematic character to the danger list in order to avoid some kind of a conflict with the "list" formatting. Or may I use another way to write this list on a compact form to the output file without loosing it's format.
Any kind of help is greatly appreciated!!