1

I'd like to reformat PDF main content, so I need to extract its main content, not just text, but also tables, diagrams, etc. with their layout information. I'm only interested in the main part of the content, for example, for technical paper, I'm only interested in the columns of text, tables, and diagrams. The headers, footers, and text on the margin can be ignored.

It would be like to scan content stream from PDF pages, recognize them whether they are text paragraph or other. If they are text paragraph, I may apply certain format treatment to it. If they are other like table, or diagrams, or anything not like a paragraph, I'll just keep them as is, or just shrink or enlarge to fit in the new display.

For example, the following stream, I'd collect the text, and make note of the starting point of the text relative to the page:

stream
BT
/F1 20 Tf
120 120 Td
(Hello from Steve) Tj
ET
endstream

Continue to decompose the stream content to organize in an array of document elements with relative position information, whether they are paragraph (to be able to reformat the associated text.)

I guess even just decompose a stream and tell whether they are paragraph of text and note down its relative position may not be trivial.

I found that pdf.js's page.render() might have the opportunity to help me to achieve the goal, but I haven't figured out how it could be adapted.

Also pdf2htmlEx might have similar mechanism to do so, as it can convert PDF file to html.

But not sure at what level the above tools do the rendering/conversion, if they directly do them as image, then they may not help to my purpose.

Adobe's PDF viewer on Android provides function of re-flow of PDF content on mobile phone's small screen. it may use some mechanism of full content capture, and transformation that I'd like to have.

So my question is for pointers how my requirements could be achieved?

Thanks a lot

Yu Shen
  • 2,770
  • 3
  • 33
  • 48
  • https://github.com/dotemacs/pdfboxing maybe ? is it `JS or Clojure` or `js/clojurescript` – birdspider Aug 05 '15 at 15:47
  • pdfboxing seems only support text extraction, need to figure out if it has any access to the other non-text element, and keep the sequential characteristics among all the element, text,and non-textual. – Yu Shen Aug 05 '15 at 23:10
  • 1
    A PDF can contain text (and non-text) that has no "sequential characteristics" at all (consider a Word cloud). Also, the "sequential characteristics" may be too complicated (consider a crossword puzzle). Also, the "sequential characteristics" may be obvious to a human but not to a computer (consider just about any table). Also, there is no reliable way to get any useful text at all out of any random PDF. – Jongware Aug 06 '15 at 09:28
  • @Jongware thanks for pointing out the complexity. I'm rephrasing the question. – Yu Shen Aug 07 '15 at 00:03

0 Answers0