0

I have searched for possible solution by googling/so/forums for pdfClown/pdfbox and posting the problem at SO.

Problem: I have been trying to find a solution to highlight text, which spans across multiple lines in pdf document. The pdf can have one/two-column pages.

By using pdf-clown, I was able to highlight phrases, ONLY if all the words appear in the same line. pdfBox has created the XML for individual words, I could not find solution for phrases/lines.

Please suggest solution for pdf-clown, if any. (or) any other tool that is capable of highlighting text in multiple lines in pdf, with JAVA compatibility.

I could not understand the answer similar question, but iText, any help?: Multiline markup annotations with iText

Community
  • 1
  • 1
user1830284
  • 101
  • 6
  • Is the text free-flowing like a paragraph or is it like a table? – Ross Bush Mar 27 '14 at 20:28
  • text is in paragraphs. It's not data inside tables. – user1830284 Mar 27 '14 at 20:45
  • @Irb: any possible solution that you could think of? – user1830284 Mar 31 '14 at 21:19
  • ...Not off hand. I have worked with iText and colored all cells in a table a specific color. I do not know how to do this with free flowing text, sorry. – Ross Bush Mar 31 '14 at 21:21
  • *text is in paragraphs.* do you have the coordinates of rectangles delimiting the area to be marked? – mkl Apr 01 '14 at 08:02
  • Sorry for not mentioning this earlier: The pdf's have data organized in 2-columns, for example: scholarly articles. Pdf-clown was able to annotate even if text is spanning across multiple lines in a normal pdf, but not 2-column pdf. – user1830284 Apr 01 '14 at 21:18
  • Using pdf-clown: I was able to highlight my required text(across multiples) by tweaking the regex used by pdf-clown. But, due to my tweak, it highlights data in other column as well. I have allowed 60 characters(added ".{0,60}" to regex after each token) anywhere between words of my required text, so it highlights those 60 characters which are in the unwanted column. Any suggestions on improving this, so I remove the unwanted highlights now? – user1830284 Apr 08 '14 at 22:28
  • The reference to the "Multiline markup annotations with iText" is useless here, as PDF Clown already supports text highlighting across multiple contiguous lines (see the demonstration here: http://pdfclown.org/2011/04/12/waiting-for-pdf-clown-0-1-1-release/ ) -- it can even automatically apply dehyphenation! – Stefano Chizzolini May 15 '15 at 11:56

2 Answers2

0

it is possible to get the coordinates of each word in a pdf document using pdfbox, here is the code for it:

import java.io.*;
import org.apache.pdfbox.exceptions.InvalidPasswordException;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.common.PDStream;
import org.apache.pdfbox.util.PDFTextStripper;
import org.apache.pdfbox.util.TextPosition;

import java.io.IOException;
import java.util.List;

public class PrintTextLocations extends PDFTextStripper {

    public PrintTextLocations() throws IOException {
        super.setSortByPosition(true);
    }

    public static void main(String[] args) throws Exception {

        PDDocument document = null;
        try {
            File input = new File("C:\\path\\to\\PDF.pdf");
            document = PDDocument.load(input);
            if (document.isEncrypted()) {
                try {
                    document.decrypt("");
                } catch (InvalidPasswordException e) {
                    System.err.println("Error: Document is encrypted with a password.");
                    System.exit(1);
                }
            }
            PrintTextLocations printer = new PrintTextLocations();
            List allPages = document.getDocumentCatalog().getAllPages();
            for (int i = 0; i < allPages.size(); i++) {
                PDPage page = (PDPage) allPages.get(i);
                System.out.println("Processing page: " + i);
                PDStream contents = page.getContents();
                if (contents != null) {
                    printer.processStream(page, page.findResources(), page.getContents().getStream());
                }
            }
        } finally {
            if (document != null) {
                document.close();
            }
        }
    }

    protected void processTextPosition(TextPosition text) {
        System.out.println("String[" + text.getXDirAdj() + ","
                + text.getYDirAdj() + " fs=" + text.getFontSize() + " xscale="
                + text.getXScale() + " height=" + text.getHeightDir() + " space="
                + text.getWidthOfSpace() + " width="
                + text.getWidthDirAdj() + "]" + text.getCharacter());
    }
}
  • 1
    This is pretty useless: PDF Clown already supports text coordinate retrieval and even text aggregation: what the user was looking for is a way to deal with multi-column layout. – Stefano Chizzolini Oct 04 '14 at 06:50
0

Multi-column text is, at the moment (PDF Clown 0.1.2), not supported for extraction: the current algorithm gathers text laying on the same horizontal baseline without evaluating possible gaps between columns.

Automatic multi-column-layout detection would be possible yet somewhat tricky, as PDF is essentially (you know) an unstructured graphic format. Nonetheless, I'm considering to experiment something about that, in order to deal at least with the most common scenarios.

In the meantime, I can suggest you to try an effective workaround (it implies that you work on a document whose columns are placed in predictable areas): for each column do a separate text extraction, instructing the TextExtractor to look into the corresponding page area, then put all those partial extraction results together and apply your filter.