4

can't figure this up this function (part of class for scraping internet site into a pdf) supposed to merge the pdf file generated from web pages using pypdf.

this is the method code:

def mergePdf(self,mainname,inputlist=0):
    """merging the pdf pages
    getting an inputlist to merge or defaults to the class instance self.pdftomerge list"""
    from pyPdf import PdfFileWriter, PdfFileReader
    self._mergelist = inputlist or self.pdftomerge
    self.pdfoutput = PdfFileWriter()

    for name in self._mergelist:
        print "merging %s into main pdf file: %s" % (name,mainname)
        self._filestream = file(name,"rb")
        self.pdfinput = PdfFileReader(self._filestream)
        for p in self.pdfinput.pages:
            self.pdfoutput.addPage(p)
        self._filestream.close()

    self._pdfstream = file(mainname,"wb")
    self._pdfstream.open()
    self.pdfoutput.write(self._pdfstream)
    self._pdfstream.close()

I keep getting this error:

  File "c:\tmp\easy_install-iik9vj\pyPdf-1.13-py2.7-win32.egg.tmp\pyPdf\pdf.py", line 264, in write
    self._sweepIndirectReferences(externalReferenceMap, self._root)
  File "c:\tmp\easy_install-iik9vj\pyPdf-1.13-py2.7-win32.egg.tmp\pyPdf\pdf.py", line 339, in _sweepIndirectReferences
    self._sweepIndirectReferences(externMap, realdata)
  File "c:\tmp\easy_install-iik9vj\pyPdf-1.13-py2.7-win32.egg.tmp\pyPdf\pdf.py", line 315, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, value)
  File "c:\tmp\easy_install-iik9vj\pyPdf-1.13-py2.7-win32.egg.tmp\pyPdf\pdf.py", line 339, in _sweepIndirectReferences
    self._sweepIndirectReferences(externMap, realdata)
  File "c:\tmp\easy_install-iik9vj\pyPdf-1.13-py2.7-win32.egg.tmp\pyPdf\pdf.py", line 315, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, value)
  File "c:\tmp\easy_install-iik9vj\pyPdf-1.13-py2.7-win32.egg.tmp\pyPdf\pdf.py", line 324, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, data[i])
  File "c:\tmp\easy_install-iik9vj\pyPdf-1.13-py2.7-win32.egg.tmp\pyPdf\pdf.py", line 339, in _sweepIndirectReferences
    self._sweepIndirectReferences(externMap, realdata)
  File "c:\tmp\easy_install-iik9vj\pyPdf-1.13-py2.7-win32.egg.tmp\pyPdf\pdf.py", line 315, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, value)
  File "c:\tmp\easy_install-iik9vj\pyPdf-1.13-py2.7-win32.egg.tmp\pyPdf\pdf.py", line 345, in _sweepIndirectReferences
    newobj = data.pdf.getObject(data)
  File "c:\tmp\easy_install-iik9vj\pyPdf-1.13-py2.7-win32.egg.tmp\pyPdf\pdf.py", line 645, in getObject
    self.stream.seek(start, 0)
ValueError: I/O operation on closed file

but when I check the status of self._pdfstream I get:

<open file 'c:\python27\learn\dive.pdf', mode 'wb' at 0x013B2020>

what am I doing wrong?

i'll be glad for any help

mac
  • 42,153
  • 26
  • 121
  • 131
alonisser
  • 11,542
  • 21
  • 85
  • 139

1 Answers1

7

OK, I found your problem. You were right to call file(). Don't try to call open() at all.

Your problem is the input file still needs to be open when you call self.pdfoutput.write(self._pdfstream), so you need to remove the line self._filestream.close().

Edit: This script will trigger the problem. The first write will succeed and the second will fail.

from pyPdf import PdfFileReader as PfR, PdfFileWriter as PfW

input_filename = 'in.PDF' # replace with a real file
output_filename = 'out.PDF' # something that doesn't exist

infile = file(input_filename, 'rb')
reader = PfR(infile)
writer = PfW()

writer.addPage(reader.getPage(0))
outfile = file(output_filename, 'wb')
writer.write(outfile)
print "First Write Successful!"
infile.close()
outfile.close()

infile = file(input_filename, 'rb')
reader = PfR(infile)
writer = PfW()

writer.addPage(reader.getPage(0))
outfile = file(output_filename, 'wb')
infile.close() # BAD!

writer.write(outfile)
print "You'll get an IOError Before this line"
outfile.close()
agf
  • 171,228
  • 44
  • 289
  • 238
  • hey agf, as I wrote my problem is with self._pdfstream. I changed to open, but this doesn't help. I still get the error when i try to do write from the pypdf and when I check the object I still get - . wtf?! – alonisser Jul 21 '11 at 10:17
  • @alonisser You're right, calling `open()` was wrong! But your problem isn't with `self._pdfstream`, it's with the input streams. Editing my answer. – agf Jul 21 '11 at 10:40
  • this seems to solve the problem - thanks alot! but now there is another problem! i get the same long error string and a different end: line 693, in readObjectHeader return int(idnum), int(generation) ValueError: invalid literal for int() with base 10: '' any ideas – alonisser Jul 21 '11 at 14:47
  • It sounds like one of your PDFs has a field that is supposed to be an integer, but isn't. Beyond that, you might have to dig in to the pyPdf source to figure out the problem. – agf Jul 21 '11 at 14:51
  • ok - I solved this. seems like the problem was with caliing pypdf to add pages to a file that already exists - changing the name of the output file to something like "output.pdf" solved this . thanka again @agf for all the help. – alonisser Jul 24 '11 at 15:43
  • I was able to put this answer to use. I still use `open` instead of `file`, but I save all my open file objects and only close them at the very end. – Charles J. Daniels Oct 13 '15 at 08:36