0

I'm doing a variation of the basic Lempel-Ziv compression in python (2.7). The case is, this algorithm would usually output a list composed by characters and integers, the last ones designating the order of each new string in the dictionary.

Now, imagine we compress a file large enough so there will appear integers up to 400000 or more, so I what I'm doing is to pass each one of those integers to binary, decompose the binary in up-to-8-bit bytes (the binary form of 400000, for example, is a string of 1 and 0 of about 18 or 19 bits, so it could be decomposed in 2 8-bit bytes and a byte of 2 or 3 bits), and that way each 6-character integer will get reduced to a 3-char. string. Note that even the 3-digit integers get reduced to 2-char. strings, so that way the list obtained by the LZW algorithm is more compact.

What happens is, I'm correctly able to compress a file with the code (from 2.2 Mb to 1.5 Mb), or so I think, but when I decompress it I don't obtain the exact same initial text.

Here is my compression code:

def encode(order):
    danger = [0, 9, 10, 13, 32, 222, 255, 256]
    str2 = ""
    str3 = ""
    binary = bin(order)[2:]
    for bit in binary:
        str2 += bit
        if len(str2) == 8:
            helper = int(str2,2)
            if helper in danger:
                str3 = chr(222)+str(order) #222 is choosable, may be another ASCII one
                str2 = ""
                break
            else:
                str3 += chr(int(str2,2)) 
                str2 = ""
    if str2 != "":
        helper = int(str2,2)
        if helper in danger:
            str3 = chr(222)+str(order)
        else:
            str3 += chr(int(str2,2))
    return str3

file_in = open("donquijote.txt")
file_out = open("compressed5.txt","w")

codes = dict([(chr(x), x) for x in range(256)])
danger = [0, 9, 10, 13, 32, 222, 255, 256]      
code_count = 257
current_string = ""
string = file_in.read()
for c in string:
    current_string = current_string + c
    if not current_string in codes:
        codes[current_string] = code_count
        if (codes[current_string[:-1]] < 257) & (codes[current_string[:-1]] not in danger):
            file_out.write(chr(codes[current_string[:-1]])+" ")
        else:
            str4 = encode(codes[current_string[:-1]])
            file_out.write(str4+" ")
        code_count += 1
        current_string = c
file_out.write(encode(codes[current_string]))

file_in.close()
file_out.close()

Okay, so the tricky part in all this is that as I'm writing the compressed code to a file, and in order to maintain it's "list" format I'm separating each component of the list by a blank space, thus I'm economizing in commas(traditional lists go like ['A', 'B, 'C', ...]). Due to that, I've defined a list - danger - which contains the problematic characters that could make this "phantom list" format vanish, such as spaces, nulls, tabs, etc.. And when one of those appears I maintain it's integer reference to the dictionary by putting in front the same character (I've chosen it to be the 222-corresponding ASCII, though it could be another one), which is also included in the list 'danger'. That way, in the decompression process, when this character appears the code automatically knows the sequence behind him has to be directly saved as a reference for the dictionary, and not to be decoded to binary and mixed up again.

Here is my decompression code:

output = open("compressed5.txt")
descomp = open("decompressed5.txt","w")

text = output.read()
compressed_data = text.split()
strings = dict([(x, chr(x)) for x in range(256)])

next_code = 257
previous_string = ""
binary = ""
a = 1
for element in compressed_data:
    for char in element:
        if ord(char) == 222:
            c = int(element[1:])
            break
        else:
            binary += bin(ord(char))[2:]
            if a == len(element):
                c = int(binary,2)
                a = 1
            else:
                a += 1
    binary = ""
    if not (strings.has_key(c)):
        strings[c] = previous_string + (previous_string[0])
    descomp.write(strings[c])
    if not(len(previous_string) == 0):
        strings[next_code] = previous_string + (strings[c][0])
        next_code +=1
    previous_string = strings[c]

output.close()
descomp.close()

I can't see what I'm missing here(I'm quite of a novice in python actually), or if I should consider adding another problematic character to the danger list in order to avoid some kind of a conflict with the "list" formatting. Or may I use another way to write this list on a compact form to the output file without loosing it's format.

Any kind of help is greatly appreciated!!

stonebird
  • 93
  • 2
  • 10
  • 1
    What is the smallest input data which doesn't work? What is the output? – Peter Wood Nov 19 '15 at 05:42
  • 1. how can you be sure the compression is flawless? 2. why not use binary file access (I am not phyton user but most languages provide the API for binary access inherited from OS API) instead of that weird extended ASCII mess with high likelly messed signed format? 3. if this is your custom compression and you can modify it to your needs have you considered something like GIF compression (no need to store the dictionary)? also I would avoid all ASCII below `32` because there are some printer escape codes used also by ASCII file access functions that can terminate premature before end of file – Spektre Nov 19 '15 at 08:13
  • Sorry for the late reply, I was able to fix it - it wasn't going well indeed due to the appearance of strange characters - I omitted some of those, especially those below 32 and the result was less error in the encoding, although less compression too, so finally it wasn't a good method to encode the compressed data. The compression algorithm - LZW works perfectly, the only thing is that one has to find a way to store the compressed data in one way or another, in order for the output file to be really "compressed" - I did that by changing the base 10 to a bigger one to represent the integers. – stonebird Nov 27 '15 at 22:26
  • Also, @Spektre, by GIF compression I suppose you mean the Huffman algorithm..(storing a dictionary is usually a Huffman derivative or so)? No, LZW is a different implementation and I was supposed to do that one. Thanks for the help anyway :) – stonebird Nov 27 '15 at 22:30
  • 1. In a GIF encoder/decoder the dictionary for LZW is created in a special way on the fly during encoding and also during decoding so it really is not stored inside image file (it has nothing to do with Huffman encoding) see [3MF project GIF](http://www.matthewflickinger.com/lab/whatsinagif/lzw_image_data.asp). The downside of that is you need to clear the dictionary from time to time so there is command for that inside LZW compressed stream (called clear code). 2. if you use binary file access then you got base 256 for the stored words and no limitations ... – Spektre Nov 28 '15 at 06:52

0 Answers0