0

I am trying to read and process contents from a .pdf file.The target with this process is not to use any extraneous libraries that are not attached with the raw PHP bootstrap installation.I have tried to use file_get_contents to store the read content into a string variable.The problem is on echo-ing this content,it is all gibberish.

This,I can say,was expected since the character encoding is different from browser supported formats.I tried to use PHP's iconv function to convert the encoding from ASCII to utf-8,CP1252 to utf-8 and vice versa but none spawned readable output.

So the question is,what is the character encoding associated with .pdf files and how does one convert read content from such files to browser supported character encoding.Thanks.

pmrutu
  • 21
  • 4
  • First, you do realize that pdf files are binary files with their own format (like docx files) and that a simply 're-encoding' will never fix that issue, you need to be able to convert and read the entire pdf (so, look in to that). – Jon Dec 02 '12 at 22:45
  • Okay,I had the impression that file_get_contents would have dealt with that situation.Thanks for the insightful input,let me try to investigate. – pmrutu Dec 02 '12 at 22:49
  • Okay so it seems that the binary information,checked with finfo_file,is in ASCII strings,so basically file_get_contents actually should be able to get the entire .pdf file as a string.This is explained in my question,my problem is just making this string legible. – pmrutu Dec 03 '12 at 00:06
  • http://en.wikipedia.org/wiki/Portable_Document_Format#Technical_foundations More specifically: http://en.wikipedia.org/wiki/Portable_Document_Format#Text – Jon Dec 03 '12 at 00:35
  • It is possible to create a PDF file whose bytes are from the range of ASCII codes only (even though this generally is not done these days anymore). This doesn't mean, though, immediate readability. Simply open a PDF in a text editor to get a first impression. – mkl Dec 03 '12 at 06:45
  • I know,its what I am experiencing hence the question,I need to convert those ASCII code sets to readable utf-8 format strings. – pmrutu Dec 03 '12 at 15:50
  • @pmrutu So in essence you need some PDF parsing library (with methods extracting the text from the content if possible). If you really don't want to use a third-party library, you need to implement the methods yourself. As a primer you might start by studying the specification [ISO 32000-1:2008](http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf) with a focus on the chapters 7, 8, and 9. – mkl Dec 06 '12 at 13:17

0 Answers0