3

I would like this code to work the same when run with Python 2 or Python 3

from zipfile import ZipFile, ZipInfo

with ZipFile("out.zip", 'w') as zf:
    content = "content"
    info = ZipInfo()
    info.filename = "file.txt"
    info.flag_bits = 0x800
    info.file_size = len(content)
    zf.writestr(info, content)

However, under Python 2 out.zip starts:

50 4b 03 04 14 00 00 08

Under Python3, it starts:

50 4b 03 04 14 00 00 00

The differing part is flag_bits, set to 0x800 for Python 2, 0x00 for Python 3. That's BIT11: language encoding. BIT11 seems to get set if filename.encode("ascii") throws.

I tried to force this bit on by setting the flag after creating the ZipInfo object, but it gets reset back to 0x00 in _open_to_write().

I wonder if anyone here has a good solution. Ideally I'd like both outputs to have the flag set, because that mirrors what the jar utility does.

EDIT: Updated to add the info.flag_bits = 0x800 line just to spell out what I'm trying to achieve. I've reproduced this on Windows: ActivePython 3.6.0.3600, vs ActivePython 2.7.14.2717, Windows 10. And on Linux: Python 3.6.6 vs Python 2.7.11 In case it matters, I am running this exactly as my example, no hashbang, invoking the interpreter directly:

pythonX test.py
Dharman
  • 30,962
  • 25
  • 85
  • 135
Keeely
  • 895
  • 9
  • 21
  • Perhaps I am mistaken but I seem to get the output `50 4b 03 04 14 00 00 00` for both Python 2 and Python 3 on my Debian machine under Python 3.5.3 and Python 2.7.13 – Algorithmic Canary Nov 12 '18 at 01:06
  • Likewise, it's the same output on Windows with Python 2 and 3 for me (as what you show for Python 3 in your question). Sounds like something OS-dependent. What are you running? – martineau Nov 12 '18 at 02:50
  • @martineau that's still not what I want, I want the bit set for both, I've changed my question as it wasn't so clear before. Thanks for testing this, it's useful feedback, perhaps you can post your versions. – Keeely Nov 12 '18 at 09:36
  • Keeely: Got it. Your code creates a file. You want make sure a bit is set at a certain offset in that file. Seems like if nothing else you could modify the file manually after it's created using binary file I/O. – martineau Nov 12 '18 at 09:42
  • @martineau, indeed that is my last resort, but it's a pretty horrible solution. – Keeely Nov 12 '18 at 09:47
  • I've reproduced this with Python 3.8.0a0 (heads/master:0d12672b30, Nov 13 2018, 09:34:21) (latest git), and raised a ticket for it: https://bugs.python.org/issue35218. – Keeely Nov 13 '18 at 09:46

2 Answers2

1

Edit: Here's code that works for me with Python 2.7 but not with 3.6 (a bit of a mystery, it seemed to work earlier this evening):

$ cat zipf.py
from __future__ import print_function

from zipfile import ZipFile, ZipInfo

with ZipFile("out.zip", 'w') as zf:
    content = "content"
    info = ZipInfo()
    info.filename = "file.txt"
    info.flag_bits = 0x800
    # don't set info.file_size here: zf.writestr() does that
    zf.writestr(info, content)

with open('out.zip', 'rb') as stream:
    byteseq = stream.read(8)
    for i in byteseq:
        if isinstance(i, str): i = ord(i)
        print('{:02x}'.format(i), end=' ')
    print()

Run as:

$ python2.7 zipf.py
50 4b 03 04 14 00 00 08 

but:

$ python3.6 zipf.py
50 4b 03 04 14 00 00 00 

It's certainly possible to make it work, by making sure the file is opened before creating the info entry. However, then you must avoid writestr, and this only works with Python 3.6 (and seems rather abusive):

from __future__ import print_function

from zipfile import ZipFile, ZipInfo

with ZipFile("out.zip", 'w') as zf:
    info = ZipInfo()
    info.filename = "file.txt"
    content = "content"
    if not isinstance(content, bytes):
        content = content.encode('utf8')
    info.file_size = len(content)
    with zf.open(info, 'w') as stream:
        info.flag_bits = 0x800
        stream.write(content)

with open('out.zip', 'rb') as stream:
    byteseq = stream.read(8)
    for i in byteseq:
        if isinstance(i, str): i = ord(i)
        print('{:02x}'.format(i), end=' ')
    print()

It's probably the case that 3.6 resetting all the info.flag_bits (through the internal open that it does) is just incorrect, although it's not really clear to me.

Original answer below

I cannot reproduce this, but you're right that bit 11 in the flag bits is set if the file name is Unicode and encoding as ASCII fails:

def _encodeFilenameFlags(self):
    if isinstance(self.filename, unicode):
        try:
            return self.filename.encode('ascii'), self.flag_bits
        except UnicodeEncodeError:
            return self.filename.encode('utf-8'), self.flag_bits | 0x800
    else:
        return self.filename, self.flag_bits

(Python 2.7 zipfile.py source) or:

def _encodeFilenameFlags(self):
    try:
        return self.filename.encode('ascii'), self.flag_bits
    except UnicodeEncodeError:
        return self.filename.encode('utf-8'), self.flag_bits | 0x800

(Python 3.6 zipfile.py source).

To get the bit set you need a filename that cannot be encoded directly in ASCII, e.g.:

info.filename = u"sch\N{latin small letter o with diaeresis}n" # "file.txt"

(this notation works with both Python 2.7 and 3.6).

I tried to force this bit on by setting the flag after creating the ZipInfo object, but it gets reset back to 0x00 in _open_to_write().

If I add:

info.filename = "file.txt"
info.flag_bits |= 0x0800

(just after setting the filename to u"schön") and run this under Python 2.7 or 3.6, I get the bit set in the header (of course the file name in the zip directory changes back to file.txt).

torek
  • 448,244
  • 59
  • 642
  • 775
  • Can you post your full code if you got the bit set for filename==file.txt with Python3? – Keeely Nov 12 '18 at 09:34
  • @Keeely: I deleted it after posting, but I started by copying your sample from before the last edit. It essentially matched your current sample. I ran it on FreeBSD but the behavior should be the same as long as the `zipfile` library code is the same... – torek Nov 12 '18 at 09:55
  • thanks, but can I have your precise major+ minor versions for all Pythons used. I have up-voted the post, but at the moment it doesn't exactly give a solution (bit set for both Python versions) so cannot accept. – Keeely Nov 12 '18 at 10:33
  • One is `sys.version_info(major=2, minor=7, micro=15, releaselevel='final', serial=0)`, the other is `sys.version_info(major=3, minor=6, micro=6, releaselevel='final', serial=0)`. Let me try re-creating the test, too. – torek Nov 12 '18 at 11:26
0

I am using something like this for the time being:

from zipfile import ZipFile, ZipInfo
import struct

orig_function = ZipInfo.FileHeader

def new_function(self, zip64=None):
    header = orig_function(self, zip64)
    fmt = "B"*len(header)
    blist = list(struct.unpack(fmt, header))
    blist[7] |= 0x8
    return struct.pack(fmt, *blist)

setattr(ZipInfo, "FileHeader", new_function)

with ZipFile("out.zip", 'w') as zf:
    content = "content"
    info = ZipInfo()
    info.filename = "file.txt"
    info.file_size = len(content)
    zf.writestr(info, content)

Hopefully it won't break too soon, FileHeader() seems like something that won't be changing in the future.

Keeely
  • 895
  • 9
  • 21