I am new to python and we had been trying to use lzw code from GIT in the program. https://github.com/joeatwork/python-lzw/blob/master/lzw/init.py
This is working well if we have a smaller blob but if the blob size increases it doesn't decompress the blob. So I had been reading the documentation but I am unable to understand the below which might be the reason why the full blob is not getting decompressed.
I have also attached a strip of the python code I am using.
Our control codes are
- CLEAR_CODE (codepoint 256). When this code is encountered, we flush
the codebook and start over.
- END_OF_INFO_CODE (codepoint 257). This code is reserved for
encoder/decoders over the integer codepoint stream (like the
mechanical bit that unpacks bits into codepoints)
When dealing with bytes, codes are emitted as variable
length bit strings packed into the stream of bytes.
codepoints are written with varying length
- initially 9 bits
- at 512 entries 10 bits
- at 1025 entries at 11 bits
- at 2048 entries 12 bits
- with max of 4095 entries in a table (including Clear and EOI)
code points are stored with their MSB in the most significant bit
available in the output character.
My code strip :
def decompress_without_eoi(buf):
# Decompress LZW into a bytes, ignoring End of Information code
def gen():
try:
for byte in lzw.decompress(buf):
yield byte
except ValueError as exc:
#print(repr(exc))
if 'End of information code' in repr(exc):
#print('Ignoring EOI error..\n')
pass
else:
raise
return
try:
#print('Trying a join..\n')
deblob = b''.join(gen())
except Exception as exc2:
#print(repr(exc2))
#print('Trying byte by byte..')
deblob=[]
try:
for byte in gen():
deblob.append(byte)
except Exception as exc3:
#print(repr(exc3))
return b''.join(deblob)
return deblob
#current function to deblob
def deblob3(row):
if pd.notnull(row[0]):
blob = row[0]
h = html2text.HTML2Text()
h.ignore_links=True
h.ignore_images = True #zzzz
if type(blob) != bytes:
blobbytes = blob.read()[:-10]
else:
blobbytes = blob[:-10]
if row[1]==361:
# If compressed, return up to EOI-257 code, which is last non-null code before tag
# print (row[0])
return h.handle(striprtf(decompress_without_eoi(blobbytes)))
elif row[1]==360:
# If uncompressed, return up to tag
return h.handle(striprtf(blobbytes))
This function has been called as per below
nf['IS_BLOB'] = nf[['IS_BLOB','COMPRESSION']].apply(deblob3,axis=1)