I regularly have to extract images that have been copy-pasted in Excel files. Unfortunately, these files come in the despicable XLS format. So, as the simple unzip trick does not work, I decided to give a try at making a little python script myself to do it.
(Extracting images is painful as I have to actually copy-paste into Paint to save them. There is no Save as... or Export button.)
If you look at the PNG reference (or already know it), you will see it basically starts with a èPNG
marker and ends with an IEND
chunk.
So I tried the following code:
import sys
import os
def info(s):
print("[i] "+s)
info("Opening file: " + sys.argv[1])
with open(sys.argv[1],'rb') as f:
buf = f.read()
info("File read")
offset_s = buf.find(b'\x89PNG\x0D\x0A\x1A\x0A')
if offset_s == -1:
error("PNG not found")
os.exit(-1)
else:
info("PNG start found at offset: {}".format(offset_s))
offset_e = buf.find(b'IEND')
if offset_e == -1:
error("PNG not found")
os.exit(-1)
else:
offset_e += 8
info("PNG end found at offset: {}".format(offset_e))
with open("out.png", "wb") as f:
f.write(buf[offset_s:offset_e])
info("Written to out.png")
So it extracts the data. But the PNG data is corrupt (in the IDAT chunk), so it does not display properly. Here is the result of a pngcheck run:
File: out.png (221879 bytes)
chunk IHDR at offset 0x0000c, length 0
1366 x 768 image, 24-bit RGB, non-interlaced
chunk sRGB at offset 0x00025, length 0
rendering intent = perceptual
chunk pHYs at offset 0x00032, length 0: 3780x3780 pixels/meter (96 dpi)
chunk IDAT at offset 0x00047, length 0
zlib: deflated, 32K window, fast compression
CRC error in chunk IDAT (actual 632dd60d, should be 5985ed29)
Chunk name fffffffb 02 ffffff8a 5e doesn't conform to naming rules.
chunk ?? at offset 0x10008, length 0
Do you think (or know for a fact? - but I have not found this information when tried) Excel stores PNG files with specific (or even proprietary) filters / compression algorithm?
Any idea on how I could get it to work?
Edit - research follow-up: I have been pursuing the analysis further. I took a bigger image, put it in a blank Excel file, and saved as XLS.
Then, I extracted it with my previous tool, and made a new one to identify 4-byte items added by Excel. Here goes the code:
import sys
import os
import binascii
def info(s):
print("[i] "+s)
def die(s):
print("[!] "+s)
sys.exit(-1)
info("Opening original file: " + sys.argv[1])
i = 0
with open(sys.argv[1], 'rb') as original:
info("Opening changed file: " + sys.argv[2])
with open(sys.argv[2], 'rb') as changed:
o_byte = original.read(1)
c_byte = changed.read(1)
while o_byte != b"":
if c_byte == b"":
die("Error reading from changed file.")
while c_byte != o_byte:
info("{:08X} - Found diff: 0x{:02X} 0x{:02X} 0x{:02X} 0x{:02X}".format(i, ord(c_byte), ord(changed.read(1)), ord(changed.read(1)), ord(changed.read(1))))
i += 4
c_byte = changed.read(1)
o_byte = original.read(1)
c_byte = changed.read(1)
i += 1
Running it against my original and XLS-extracted png files, I get the following out put:
[i] Opening original file: test1.PNG
[i] Opening changed file: out.png
[i] 00001FAB - Found diff: 0xEB 0x00 0x20 0x20
[i] 00003FCF - Found diff: 0x3C 0x00 0x20 0x20
[i] 00005FF3 - Found diff: 0x3C 0x00 0x20 0x20
[i] 00008017 - Found diff: 0x3C 0x00 0x20 0x20
[i] 000090BE - Found diff: 0x81 0x00 0x00 0x00
[i] 000090C2 - Found diff: 0x82 0x00 0x00 0x00
[i] 000090C6 - Found diff: 0x83 0x00 0x00 0x00
[i] 000090CA - Found diff: 0x84 0x00 0x00 0x00
[i] 000090CE - Found diff: 0x85 0x00 0x00 0x00
[i] 000090D2 - Found diff: 0x86 0x00 0x00 0x00
[i] 000090D6 - Found diff: 0x87 0x00 0x00 0x00
[i] 000090DA - Found diff: 0x88 0x00 0x00 0x00
[i] 000090DE - Found diff: 0x89 0x00 0x00 0x00
[i] 000090E2 - Found diff: 0x8A 0x00 0x00 0x00
[i] 000090E6 - Found diff: 0x8B 0x00 0x00 0x00
[i] 000090EA - Found diff: 0x8C 0x00 0x00 0x00
[i] 000090EE - Found diff: 0x8D 0x00 0x00 0x00
[i] 000090F2 - Found diff: 0x8E 0x00 0x00 0x00
[i] 000090F6 - Found diff: 0x8F 0x00 0x00 0x00
[i] 000090FA - Found diff: 0x90 0x00 0x00 0x00
[i] 000090FE - Found diff: 0x91 0x00 0x00 0x00
[i] 00009102 - Found diff: 0x92 0x00 0x00 0x00
[i] 00009106 - Found diff: 0x93 0x00 0x00 0x00
[i] 0000910A - Found diff: 0x94 0x00 0x00 0x00
[i] 0000910E - Found diff: 0x95 0x00 0x00 0x00
[i] 00009112 - Found diff: 0x96 0x00 0x00 0x00
[i] 00009116 - Found diff: 0x97 0x00 0x00 0x00
[i] 0000911A - Found diff: 0x98 0x00 0x00 0x00
[i] 0000911E - Found diff: 0x99 0x00 0x00 0x00
[i] 00009122 - Found diff: 0x9A 0x00 0x00 0x00
[i] 00009126 - Found diff: 0x9B 0x00 0x00 0x00
[i] 0000912A - Found diff: 0x9C 0x00 0x00 0x00
[i] 0000912E - Found diff: 0x9D 0x00 0x00 0x00
[i] 00009132 - Found diff: 0x9E 0x00 0x00 0x00
[i] 00009136 - Found diff: 0x9F 0x00 0x00 0x00
[i] 0000913A - Found diff: 0xA0 0x00 0x00 0x00
[i] 0000913E - Found diff: 0xA1 0x00 0x00 0x00
[i] 00009142 - Found diff: 0xA2 0x00 0x00 0x00
[i] 00009146 - Found diff: 0xA3 0x00 0x00 0x00
[i] 0000914A - Found diff: 0xA4 0x00 0x00 0x00
[i] 0000914E - Found diff: 0xA5 0x00 0x00 0x00
[i] 00009152 - Found diff: 0xA6 0x00 0x00 0x00
[i] 00009156 - Found diff: 0xA7 0x00 0x00 0x00
[i] 0000915A - Found diff: 0xA8 0x00 0x00 0x00
[i] 0000915E - Found diff: 0xA9 0x00 0x00 0x00
[i] 00009162 - Found diff: 0xAA 0x00 0x00 0x00
[i] 00009166 - Found diff: 0xAB 0x00 0x00 0x00
[i] 0000916A - Found diff: 0xAC 0x00 0x00 0x00
[i] 0000916E - Found diff: 0xAD 0x00 0x00 0x00
[i] 00009172 - Found diff: 0xAE 0x00 0x00 0x00
[i] 00009176 - Found diff: 0xAF 0x00 0x00 0x00
[i] 0000917A - Found diff: 0xB0 0x00 0x00 0x00
[i] 0000917E - Found diff: 0xB1 0x00 0x00 0x00
[i] 00009182 - Found diff: 0xB2 0x00 0x00 0x00
[i] 00009186 - Found diff: 0xB3 0x00 0x00 0x00
[i] 0000918A - Found diff: 0xB4 0x00 0x00 0x00
[i] 0000918E - Found diff: 0xB5 0x00 0x00 0x00
[i] 00009192 - Found diff: 0xB6 0x00 0x00 0x00
[i] 00009196 - Found diff: 0xB7 0x00 0x00 0x00
[i] 0000919A - Found diff: 0xB8 0x00 0x00 0x00
[i] 0000919E - Found diff: 0xB9 0x00 0x00 0x00
[i] 000091A2 - Found diff: 0xBA 0x00 0x00 0x00
[i] 000091A6 - Found diff: 0xBB 0x00 0x00 0x00
[i] 000091AA - Found diff: 0xBC 0x00 0x00 0x00
[i] 000091AE - Found diff: 0xBD 0x00 0x00 0x00
[i] 000091B2 - Found diff: 0xBE 0x00 0x00 0x00
[i] 000091B6 - Found diff: 0xBF 0x00 0x00 0x00
[i] 000091BA - Found diff: 0xC0 0x00 0x00 0x00
[i] 000091BE - Found diff: 0xC1 0x00 0x00 0x00
[i] 000091C2 - Found diff: 0xC2 0x00 0x00 0x00
[i] 000091C6 - Found diff: 0xC3 0x00 0x00 0x00
[i] 000091CA - Found diff: 0xC4 0x00 0x00 0x00
[i] 000091CE - Found diff: 0xC5 0x00 0x00 0x00
[i] 000091D2 - Found diff: 0xC6 0x00 0x00 0x00
[i] 000091D6 - Found diff: 0xC7 0x00 0x00 0x00
[i] 000091DA - Found diff: 0xC8 0x00 0x00 0x00
[i] 000091DE - Found diff: 0xC9 0x00 0x00 0x00
[i] 000091E2 - Found diff: 0xCA 0x00 0x00 0x00
[i] 000091E6 - Found diff: 0xCB 0x00 0x00 0x00
[i] 000091EA - Found diff: 0xCC 0x00 0x00 0x00
[i] 000091EE - Found diff: 0xCD 0x00 0x00 0x00
[i] 000091F2 - Found diff: 0xCE 0x00 0x00 0x00
[i] 000091F6 - Found diff: 0xCF 0x00 0x00 0x00
[i] 000091FA - Found diff: 0xD0 0x00 0x00 0x00
[i] 000091FE - Found diff: 0xD1 0x00 0x00 0x00
[i] 00009202 - Found diff: 0xD2 0x00 0x00 0x00
[i] 00009206 - Found diff: 0xD3 0x00 0x00 0x00
[i] 0000920A - Found diff: 0xD4 0x00 0x00 0x00
[i] 0000920E - Found diff: 0xD5 0x00 0x00 0x00
[i] 00009212 - Found diff: 0xD6 0x00 0x00 0x00
[i] 00009216 - Found diff: 0xD7 0x00 0x00 0x00
[i] 0000921A - Found diff: 0xD8 0x00 0x00 0x00
[i] 0000921E - Found diff: 0xD9 0x00 0x00 0x00
[i] 00009222 - Found diff: 0xDA 0x00 0x00 0x00
[i] 00009226 - Found diff: 0xDB 0x00 0x00 0x00
[i] 0000922A - Found diff: 0xDC 0x00 0x00 0x00
[i] 0000922E - Found diff: 0xDD 0x00 0x00 0x00
[i] 00009232 - Found diff: 0xDE 0x00 0x00 0x00
[i] 00009236 - Found diff: 0xDF 0x00 0x00 0x00
[i] 0000923A - Found diff: 0xE0 0x00 0x00 0x00
[i] 0000923E - Found diff: 0xE1 0x00 0x00 0x00
[i] 00009242 - Found diff: 0xE2 0x00 0x00 0x00
[i] 00009246 - Found diff: 0xE3 0x00 0x00 0x00
[i] 0000924A - Found diff: 0xE4 0x00 0x00 0x00
[i] 0000924E - Found diff: 0xE5 0x00 0x00 0x00
[i] 00009252 - Found diff: 0xE6 0x00 0x00 0x00
[i] 00009256 - Found diff: 0xE7 0x00 0x00 0x00
[i] 0000925A - Found diff: 0xE8 0x00 0x00 0x00
[i] 0000925E - Found diff: 0xE9 0x00 0x00 0x00
[i] 00009262 - Found diff: 0xEA 0x00 0x00 0x00
[i] 00009266 - Found diff: 0xEB 0x00 0x00 0x00
[i] 0000926A - Found diff: 0xEC 0x00 0x00 0x00
[i] 0000926E - Found diff: 0xED 0x00 0x00 0x00
[i] 00009272 - Found diff: 0xEE 0x00 0x00 0x00
[i] 00009276 - Found diff: 0xEF 0x00 0x00 0x00
[i] 0000927A - Found diff: 0xF0 0x00 0x00 0x00
[i] 0000927E - Found diff: 0xF1 0x00 0x00 0x00
[i] 00009282 - Found diff: 0xF2 0x00 0x00 0x00
[i] 00009286 - Found diff: 0xF3 0x00 0x00 0x00
[i] 0000928A - Found diff: 0xFE 0xFF 0xFF 0xFF
[i] 0000928E - Found diff: 0xFE 0xFF 0xFF 0xFF
[i] 00009292 - Found diff: 0xF6 0x00 0x00 0x00
[i] 00009296 - Found diff: 0xFE 0xFF 0xFF 0xFF
[i] 0000929A - Found diff: 0xFE 0xFF 0xFF 0xFF
[i] 0000929E - Found diff: 0xFF 0xFF 0xFF 0xFF
[i] 000092A2 - Found diff: 0xFF 0xFF 0xFF 0xFF
[i] 000092A6 - Found diff: 0xFF 0xFF 0xFF 0xFF
[i] 000092AA - Found diff: 0xFF 0xFF 0xFF 0xFF
[i] 000092AE - Found diff: 0xFF 0xFF 0xFF 0xFF
[i] 000092B2 - Found diff: 0xFF 0xFF 0xFF 0xFF
[i] 000092B6 - Found diff: 0xFF 0xFF 0xFF 0xFF
[i] 000092BA - Found diff: 0xFF 0xFF 0xFF 0xFF
[i] 0000A23B - Found diff: 0x3C 0x00 0x20 0x20
[i] 0000C25F - Found diff: 0x3C 0x00 0x20 0x20
[i] 0000E283 - Found diff: 0x3C 0x00 0x20 0x20
[i] 000102A7 - Found diff: 0x3C 0x00 0x20 0x20
[i] 000122CB - Found diff: 0x3C 0x00 0x20 0x20
[i] 000142EF - Found diff: 0x3C 0x00 0x20 0x20
[i] 00016313 - Found diff: 0x3C 0x00 0x20 0x20
[i] 00018337 - Found diff: 0x3C 0x00 0x20 0x20
[i] 0001A35B - Found diff: 0x3C 0x00 0x0D 0x0B
Who the hell is this 0x3C
guy? And why does Excel start counting at some point? ( 0x81
, 0x82
, 0x83
... )
Edit - additional pointers: It seems 0x003C
is the identifier of a CONTINUE
record in the Excel file format, as documented in https://www.openoffice.org/sc/excelfileformat.pdf
And the counting might be the compound document SSAT table, but I am not sure.
But still no idea about the 0xEB
though.