0

When I extracted content from a pdf file with 12 pages using my program based on pdfminer, I got wrong result with only 11 pages. I tested it with other files and got right result in most cases.

By accident, I opened it with preview app in OS X Yosemite(v10.10.4), and save it without any other operations. Then the result I got from program was right. I found size of this file was changed from 2m to 300k by preview, but have no idea what it had done.

I tried searching an answer, but most topics are about using export function of preview app to compress pdf file, and seems no one come across the same problem with pdfminer neither.

1, What does preview app do with a pdf file when "save" ?

2, How can I deal with the problem ?

Thanks in advance!

soulcoder
  • 13
  • 4
  • The problem is fixed. Another program generated pdf file in a wrong way, and make it contains redundant content. **Preview** app open it and remove the meaningless part, so its size changed. **pdfminer** still works in the right way, but its fault-tolerant is not so good. – soulcoder Aug 26 '15 at 09:49

1 Answers1

1

PDF is a complex file format which supports many different features and ways of doing things. Your pdfminer app apparently has problems with some of those features, which causes it to misinterpret certain files. Preview on the other hand seems to correctly support everything and was able to correctly read the file into its internal presentation format. When you then re-saved the file, Preview wrote it in the way that it would write the same information. Again, lots of different ways to do the same thing means different programs will do things differently.

Preview apparently has a better, more compatible, more streamlined way to express the same content; and your pdfminer can handle it better.

deceze
  • 510,633
  • 85
  • 743
  • 889