1

I need to convert PDF files to text to extract information using Perl. But I am not getting the text file in positional format means the position of the elements in the PDF and text should be same. I tried CAM::PDF::PageText but the output is very different.

I have come across posts referring to pdftotext and Poppler but I am not able to setup any of these in my Windows 10 64-bit system.

Please let me know if there are any other ways to solve this problem.

Borodin
  • 126,100
  • 9
  • 70
  • 144
Mohit
  • 608
  • 4
  • 19

2 Answers2

1

What you really want is pdftohtml with the -xml output. You can build it on Windows.

There are 2 ways to compile poppler on Windows:

  • using mingw compiler under cygwin
  • using native Visual Studio (msvc) makefile

This document describes the second method. ...

You can download Visual Studio Community Edition subject to license terms to get the 2013 and 2015 versions of compilers and build tools along with the IDE.

Or, you can just get the Visual C++ build tools. See also Walkthrough: Compiling a Native C++ Program on the Command Line.

Community
  • 1
  • 1
Sinan Ünür
  • 116,958
  • 15
  • 196
  • 339
0

Sorry for the delay but finally I got a solution for this which is pdftotext by Xpdf and the best way is to download pre compiled binaries (.exe) files. And then using the commmand line invaocation we can use the various tools like pdftohtml, pdftotext etc.

Look at this page

http://www.foolabs.com/xpdf/download.html

and under the heading "Precompiled binaries" you can find that.

On command prompt you need to change directory to the place where the binary is present then call the binary with the file as parameter

Exapmle: pdftotext File1.pdf

The above command will give File1.txt in the same folder where the binary is present.

Mohit
  • 608
  • 4
  • 19