0

hay all. maybe you guys can help me in my project. im using pdfcreator as a virtual printer to print to a file some images. can be pdf can be any type of image. but i need to extract data from it. can it be done? im using C#.

Kurt Pfeifle
  • 86,724
  • 23
  • 248
  • 345
Guy
  • 11
  • 1
  • Please specify in more details, what kind of data you want to extract. And from which file you want to extract: from a PDF created by PDFCreator? From an image created by PDFCreator?? Or...??? – Kurt Pfeifle Sep 07 '10 at 21:31
  • i want to extract a text from the pdf or image. maybe the data sent to the printer. im looking for a number in the text. – Guy Sep 08 '10 at 04:38

1 Answers1

0

You cannot extract text from images.

In principle, you can extract text from PDFs.

Here are two methods using Free software commandline utilities; maybe one of them fits your needs:

  1. pdftotext.exe (part of Foolabs' XPDF utilities)
  2. gswin32c.exe (Artifex' Ghostscript)

Example commandlines to extract all text from pages 3-7:

pdftotext:

pdftotext.exe ^
   -f 3 ^
   -l 7 ^
   -epl dos ^
   -layout ^
   "d:\path with spaces\to\input.pdf" ^
   "d:\path\to\output.txt"

You want to get the text output to stdout instead of a file? OK, try this:

pdftotext.exe ^
   -f 3 ^
   -l 7 ^
   -epl dos ^
   -layout ^
   "d:\path with spaces\to\input.pdf" ^
   -

Ghostscript: (Check that your installation has ps2ascii.ps in its lib subdirectory)

gswin32c.exe ^
   -q ^
   -sFONTPATH=c:/windows/fonts ^
   -dNODISPLAY ^
   -dSAFER ^
   -dDELAYBIND ^
   -dWRITESYSTEMDICT ^
   -dSIMPLE ^
   -f ps2ascii.ps ^
   -dFirstPage=3 ^
   -dLastPage=7 ^
   "c:/path/to/input.pdf" ^
   -dQUIET 

Text output will appear on stdout. If you test this in a cmd.exe window, you can redirect this to a file by appending > /path/to/output.txt to the command.

Kurt Pfeifle
  • 86,724
  • 23
  • 248
  • 345