How to get positional data from PDF to text

Question

I need to convert PDF files to text to extract information using Perl. But I am not getting the text file in positional format means the position of the elements in the PDF and text should be same. I tried CAM::PDF::PageText but the output is very different.

I have come across posts referring to pdftotext and Poppler but I am not able to setup any of these in my Windows 10 64-bit system.

Please let me know if there are any other ways to solve this problem.

http://stackoverflow.com/questions/6104045/installing-poppler-on-cygwin — xxfelixxx, Sep 29 '16 at 09:36
Have you considered copy-pasting the text from Acrobat into a text editor? — Borodin, Sep 29 '16 at 10:12
I am able to copy paste data but the format is not similar to PDF as I will be needing to extract information later from the text file. — Mohit, Sep 29 '16 at 12:58

score 1 · Answer 1 · edited Jun 20 '20 at 09:12

What you really want is pdftohtml with the -xml output. You can build it on Windows.

There are 2 ways to compile poppler on Windows:

using mingw compiler under cygwin

using native Visual Studio (msvc) makefile

This document describes the second method. ...

You can download Visual Studio Community Edition subject to license terms to get the 2013 and 2015 versions of compilers and build tools along with the IDE.

Or, you can just get the Visual C++ build tools. See also Walkthrough: Compiling a Native C++ Program on the Command Line.

score 0 · Accepted Answer · answered Nov 25 '16 at 05:52

Sorry for the delay but finally I got a solution for this which is pdftotext by Xpdf and the best way is to download pre compiled binaries (.exe) files. And then using the commmand line invaocation we can use the various tools like pdftohtml, pdftotext etc.

Look at this page

http://www.foolabs.com/xpdf/download.html

and under the heading "Precompiled binaries" you can find that.

On command prompt you need to change directory to the place where the binary is present then call the binary with the file as parameter

Exapmle: pdftotext File1.pdf

The above command will give File1.txt in the same folder where the binary is present.

How to get positional data from PDF to text

2 Answers2