Java - extract text from pdf from selected area to txt

Question

The idea is next,

user selects a pdf file, and then this file converted into an image and such an image is displayed in the application.

In the image the user can choose positions that wants to read from a pdf file, and when the finish with selection position in the background program reads the original pdf and text stored in a txt file.

It is important that the resulting image from pdf file is the same size as himself pdf file

The next code convert pdf to image. I use pdfrenderer-0.9.1.jar

import java.awt.Rectangle;
import java.awt.image.BufferedImage;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.RandomAccessFile;
import java.nio.ByteBuffer;
import java.nio.channels.FileChannel;
import javax.imageio.ImageIO;
import com.sun.pdfview.PDFFile;
import com.sun.pdfview.PDFPage;


public class Pdf2Image {

public static void main(String[] args) {

    File file = new File("E:\\invoice-template-1.pdf");
    RandomAccessFile raf;
    try {
        raf = new RandomAccessFile(file, "r");

        FileChannel channel = raf.getChannel();
        ByteBuffer buf = channel.map(FileChannel.MapMode.READ_ONLY, 0, channel.size());
        PDFFile pdffile = new PDFFile(buf);
        // draw the first page to an image
        int num=pdffile.getNumPages();
        for(int i=0;i<num;i++)
        {
            PDFPage page = pdffile.getPage(i);

            //get the width and height for the doc at the default zoom              
            int width=(int)page.getBBox().getWidth();
            int height=(int)page.getBBox().getHeight();             

            Rectangle rect = new Rectangle(0,0,width,height);
            int rotation=page.getRotation();
            Rectangle rect1=rect;
            if(rotation==90 || rotation==270)
                rect1=new Rectangle(0,0,rect.height,rect.width);

            //generate the image
            BufferedImage img = (BufferedImage)page.getImage(
                        rect.width, rect.height, //width & height
                        rect1, // clip rect
                        null, // null for the ImageObserver
                        true, // fill background with white
                        true  // block until drawing is done
                );

            ImageIO.write(img, "png", new File("E:/invoice-template-"+i+".png"));
        }
    } 
    catch (FileNotFoundException e1) {
        System.err.println(e1.getLocalizedMessage());
    } catch (IOException e) {
        System.err.println(e.getLocalizedMessage());
    }
}
}

Then the image is displayed to the user in JavaFX application in ImageView components. Can you help me to get the exact position of the mouse, the mouse when the user selects a portion of the image from which you want to read the text in the pdf file?

With this code I read pdf file and get text from the set position, only I must to manually input position:( . I use pdfbox-1.3.1.jar. I would like to position the client chooses to keep a picture in the list and read the text from the pdf file with all of these positions.

    File file = new File("E:/invoice-template-1.pdf");
    PDDocument document = PDDocument.load(file);
    PDFTextStripperByArea stripper = new PDFTextStripperByArea();
    stripper.setSortByPosition(true);
    Rectangle rect1 = new Rectangle(38, 275, 15, 100);
    Rectangle rect2 = new Rectangle(54, 275, 40, 100); 
    stripper.addRegion("row1column1", rect1);
    stripper.addRegion("row1column2", rect2);
    List allPages = document.getDocumentCatalog().getAllPages();
    List<PDPage> pages = document.getDocumentCatalog().getAllPages();
    int j = 0;

    for (PDPage page : pages) {
        stripper.extractRegions(page);
        stripper.setSortByPosition(true);
        List<String> regions = stripper.getRegions();
        for (String region : regions) {
            String text = stripper.getTextForRegion(region);
            System.out.println("Region: " + region + " on Page " + j);
            System.out.println("\tText: \n" + text);
        }

For example, in the next invoice, I want to select the 4 positions to export the text, and when you select the picture, the dimensions of keeping in the list, then go through the list and from those positions export text from pdf file.

If I understand you correctly, the only question is "Can you help me to get the exact position of the mouse, the mouse when the user selects a portion of the image from which you want to read the text in the pdf file?" - then I'd recommend that you modify the question as to delete all mentions of PDFBox and PDFRenderer. Btw google finds several answers to "javafx get mouse position" on stackoverflow. — Tilman Hausherr, Oct 14 '16 at 10:13
Additional hint for later: PDFRenderer hasn't been worked on for 5 years, and PDFBox (which can also render) is now at version 2.0.3 and is still being worked on. — Tilman Hausherr, Oct 14 '16 at 10:16

Java - extract text from pdf from selected area to txt

0 Answers0