4

I am trying to split a huge pdf file into several small pdfs usinf pyPdf. I was trying with this oversimplified code:

from pyPdf import PdfFileWriter, PdfFileReader 
inputpdf = PdfFileReader(file("document.pdf", "rb"))

for i in xrange(inputpdf.numPages):
  output = PdfFileWriter()
  output.addPage(inputpdf.getPage(i))
  outputStream = file("document-page%s.pdf" % i, "wb")
  output.write(outputStream)
  outputStream.close()

but I got the follow error message:

Traceback (most recent call last):
File "./hltShortSummary.py", line 24, in <module>
  for i in xrange(inputpdf.numPages):
File "/usr/lib/pymodules/python2.7/pyPdf/pdf.py", line 342, in <lambda>
  numPages = property(lambda self: self.getNumPages(), None, None)
File "/usr/lib/pymodules/python2.7/pyPdf/pdf.py", line 334, in getNumPages
  self._flatten()
File "/usr/lib/pymodules/python2.7/pyPdf/pdf.py", line 500, in _flatten
  pages = catalog["/Pages"].getObject()
File "/usr/lib/pymodules/python2.7/pyPdf/generic.py", line 466, in __getitem__
  return dict.__getitem__(self, key).getObject()
File "/usr/lib/pymodules/python2.7/pyPdf/generic.py", line 165, in getObject
  return self.pdf.getObject(self).getObject()
File "/usr/lib/pymodules/python2.7/pyPdf/pdf.py", line 549, in getObject
  retval = readObject(self.stream, self)
File "/usr/lib/pymodules/python2.7/pyPdf/generic.py", line 67, in readObject
  return DictionaryObject.readFromStream(stream, pdf)
File "/usr/lib/pymodules/python2.7/pyPdf/generic.py", line 517, in readFromStream
  value = readObject(stream, pdf)
File "/usr/lib/pymodules/python2.7/pyPdf/generic.py", line 58, in readObject
  return ArrayObject.readFromStream(stream, pdf)
File "/usr/lib/pymodules/python2.7/pyPdf/generic.py", line 153, in readFromStream
  arr.append(readObject(stream, pdf))
File "/usr/lib/pymodules/python2.7/pyPdf/generic.py", line 87, in readObject
  return NumberObject.readFromStream(stream)
File "/usr/lib/pymodules/python2.7/pyPdf/generic.py", line 232, in readFromStream
  return NumberObject(name)
ValueError: invalid literal for int() with base 10: ''

any ideas???

Alejandro
  • 4,945
  • 6
  • 32
  • 30
  • what does `print inputpdf.numPages` give you? – Senthil Kumaran Jun 18 '11 at 04:17
  • 1
    This is old, but in case someone else runs into the same issue... I found a PDF that pyPDF2 had a hard time parsing, resulting in a similar stack trace. Bug filed here: https://github.com/mstamy2/PyPDF2/issues/521 You might want to try running the PDF through a transformation, like "save as PDF" in your favorite viewer. For me, that "cleaned up" the PDF so that it could be parsed. – coppit Oct 15 '19 at 19:13

2 Answers2

2

I think this is a bug in pypdf. Check out the source here. NumberObject.readFromStream expects an integer-like string, and isn't getting one. Probably the pdf in question is malformed in some unexpected way.

senderle
  • 145,869
  • 36
  • 209
  • 233
  • Hi senderle I really appreciate your help. If the pdf is malformed, there is a way to know it? – Alejandro Jun 18 '11 at 04:55
  • Try to hardcode the number of pages. See if this changes anything. – Geo Jun 18 '11 at 05:36
  • Just give `for i in range(2)` and see if it works for 2 pages. – Senthil Kumaran Jun 18 '11 at 07:26
  • @Alejandro, I'm not sure how to tell whether a pdf is malformed, apart from opening it in various programs and looking at error messages. There might be ways to munge the binary data in useful ways but that's beyond my pdf knowledge. – senderle Jun 18 '11 at 23:27
0

Try it this way

for i in xrange(inputpdf.getNumPages()):
Senthil Kumaran
  • 54,681
  • 14
  • 94
  • 131
  • Based on the traceback, I don't see how that could make any difference -- `numPages` is just a property that calls `getNumPages`. – senderle Jun 18 '11 at 04:38
  • No, I got the same error :(. Could it be that the pdf file is damaged? – Alejandro Jun 18 '11 at 04:45
  • Are you sure you are specifying the correct path to the file? Stop thr script with pdb.set_trace() and inspect the object from the interpreter – Geo Jun 18 '11 at 05:34
  • yes, after add pdb.set_trace() it looks like the file is in the right path. :( – Alejandro Jun 18 '11 at 05:48
  • 1
    I had a problem with malformed pdfs that wouldn't open with the Python module pdfrw. I managed to fix the pdfs simply by passing them through the program pdftk. `pdftk broken_pdf.pdf output fixed_pdf.pdf`. Maybe try that. – nakedfanatic Jun 18 '11 at 08:51
  • hey thank you @nakedfanatic. After pass them thought pdftk seems better, I just had this new error: Traceback (most recent call last): File "./hltShortSummary.py", line 32, in output.addPage(inputpdf.getPage('10')) File "/usr/lib/pymodules/python2.7/pyPdf/pdf.py", line 354, in getPage return self.flattenedPages[pageNumber] TypeError: list indices must be integers, not str any ideas? – Alejandro Jun 18 '11 at 13:11
  • @nakedfanatic Thank you for tell me about pdftk program. I think that I am done with this try, instead I am trying to burst my main pdf document into single pages and then each file load it in a tex file to compile it again in a pdf file. Probably not the best way to do it, but it seems to work for me. – Alejandro Jun 18 '11 at 14:15
  • @Alejandro - Error is helpful now. Just pass getPage(10) and not getPage('10') – Senthil Kumaran Jun 18 '11 at 19:55
  • @Senthil Both give me the same error. But seems like my pdf files are damaged. After runnning pdftk, they works better. – Alejandro Jun 18 '11 at 23:00