-1

Question:
What is the difference between open(<name>, "w", encoding=<encoding>) and open(<name>, "wb") + str.encode(<encoding>)? They seem to (sometimes) produce different outputs.

Context:
While using PyFPDF (version 1.7.2), I subclassed the FPDF class, and, among other things, added my own output method (taking pathlib.Path objects). While looking at the source of the original FPDF.output() method, I noticed almost all of it is argument parsing - the only relevant bits are

#Finish document if necessary
if(self.state < 3):
    self.close()
[...]
f=open(name,'wb')
if(not f):
    self.error('Unable to create output file: '+name)
if PY3K:
    # manage binary data as latin1 until PEP461 or similar is implemented
    f.write(self.buffer.encode("latin1"))
else:
    f.write(self.buffer)
f.close()

Seeing that, my own Implementation looked like this:

def write_file(self, file: Path) -> None:
    if self.state < 3:
        # See FPDF.output()
        self.close()
    file.write_text(self.buffer, "latin1", "strict")

This seemed to work - a .pdf file was created at the specified path, and chrome opened it. But it was completely blank, even tho I added Images and Text. After hours of experimenting, I finally found a Version that worked (produced a non empty pdf file):

def write_file(self, file: Path) -> None:
    if self.state < 3:
        # See FPDF.output()
        self.close()
    # using .write_text(self.buffer, "latin1", "strict") DOES NOT WORK AND I DON'T KNOW WHY
    file.write_bytes(self.buffer.encode("latin1", "strict"))

Looking at the pathlib.Path source, it uses io.open for Path.write_text(). As all of this is Python 3.8, io.open and the buildin open() are the same.

Note: FPDF.buffer is of type str, but holds binary data (a pdf file). Probably because the Library was originally written for Python 2.

Xtrem532
  • 756
  • 8
  • 19

2 Answers2

0

Both should be the same (with minor differences).

I like open way, because it is explicit and shorter, OTOH if you want to handle encoding errors (e.g. a way better error to user), one should use decode/encode (maybe after a '\n'.split(s), and keeping line numbers)

Note: if you use the first method (open), you should just use r or w, so without b. For your question's title, it seems you did correct, but check that your example keep b, and probably for this, it used encoding. OTOH the code seems old, and I think the ".encoding" was just done because it would be more natural in Python2 mindset.

Note: I would also replace strict to backslashreplace for debugging. And possibly you may want to check and print (maybe just ord) of the first few characters of the self.buffer on both methods, to see if there are substantial differences before file.write.

I would add a file.flush() on both functions. This is one of the differences: buffering is different, and I'll make sure I close the file. Python will do it, but when debugging, it is important to see the content of the file as quick as possible (and also after an exception). Garbage collector could not guarantee all of this. Maybe you are reading a text file which was not yet flushed.

Giacomo Catenazzi
  • 8,519
  • 2
  • 24
  • 32
  • I never interact directly with any `open()` methods - I use `pathlib.Path.write_text()` or `pathlib.Path.write_binary()`, both are selfclosing and don't expose the mode parameter (or a fileobject to flush). And I use "strict", because if the library somehow uses unrecogniced chars while creating a pdf file, the output of "backslashreplace" would almost certainly be corrupt - so just raise an error isntead of writing corrup files – Xtrem532 Oct 13 '20 at 12:28
  • `backslashreplace` is only for debug, `strict` is good for production. Note: your question is about `open`, not `pathlib.Path.write_text()`. Maybe you should add a working example, which show the problem. – Giacomo Catenazzi Oct 13 '20 at 12:36
  • Sadly, the data used to construct the pdf is sensitive, and currently I don't have the time to cook up a working example. As said in the context section, the `pathlib.Path` methods are just wrappers around `io.open()`, but avoiding the common pitfalls (flushing, closing, etc). As the data __is not text__, there is no sense in using anything but "strict". I will add a note to the question. – Xtrem532 Oct 13 '20 at 12:51
  • If the data is not text, it make no sense to use `encoding` in `open` – Giacomo Catenazzi Oct 13 '20 at 12:59
  • But then why does it work when I first `encode()` the `str` and then write it with `open()` in binary mode? Shouldn't it equaly work if I pass the unmodified `str` to `open()` in text mode and specify the same encoding as in `encode()`? (I know, binary in `str` is bs, but I didn't write the Library) – Xtrem532 Oct 13 '20 at 13:06
  • `encode` will return a binary string, so it is good to use `b`. But if you provide text to a `b`, the system may confuse things. In any case: * In text mode (the default, or when 't' is included in the mode argument), the contents of the file are returned as str, the bytes having been first decoded using a platform-dependent encoding or using the specified encoding if given.* (https://docs.python.org/3/library/functions.html#open) `encoding=` is not supported for binary file (in `open`) – Giacomo Catenazzi Oct 13 '20 at 13:22
0

Aaaand found it: Path.write_bytes() will save the bytes object as is, and str.encoding doesn't touch the line endings.

Path.write_text() will encode the bytes object just like str.encode(), BUT: because the file is opened in text mode, the line endings will be normalized after encoding - in my case converting \n to \r\n because I'm on Windows. And pdfs have to use \n, no matter what platform your on.

Xtrem532
  • 756
  • 8
  • 19