hay all. maybe you guys can help me in my project. im using pdfcreator as a virtual printer to print to a file some images. can be pdf can be any type of image. but i need to extract data from it. can it be done? im using C#.
Asked
Active
Viewed 306 times
0
-
Please specify in more details, what kind of data you want to extract. And from which file you want to extract: from a PDF created by PDFCreator? From an image created by PDFCreator?? Or...??? – Kurt Pfeifle Sep 07 '10 at 21:31
-
i want to extract a text from the pdf or image. maybe the data sent to the printer. im looking for a number in the text. – Guy Sep 08 '10 at 04:38
1 Answers
0
You cannot extract text from images.
In principle, you can extract text from PDFs.
Here are two methods using Free software commandline utilities; maybe one of them fits your needs:
pdftotext.exe
(part of Foolabs' XPDF utilities)gswin32c.exe
(Artifex' Ghostscript)
Example commandlines to extract all text from pages 3-7:
pdftotext:
pdftotext.exe ^
-f 3 ^
-l 7 ^
-epl dos ^
-layout ^
"d:\path with spaces\to\input.pdf" ^
"d:\path\to\output.txt"
You want to get the text output to stdout instead of a file? OK, try this:
pdftotext.exe ^
-f 3 ^
-l 7 ^
-epl dos ^
-layout ^
"d:\path with spaces\to\input.pdf" ^
-
Ghostscript:
(Check that your installation has ps2ascii.ps
in its lib subdirectory)
gswin32c.exe ^
-q ^
-sFONTPATH=c:/windows/fonts ^
-dNODISPLAY ^
-dSAFER ^
-dDELAYBIND ^
-dWRITESYSTEMDICT ^
-dSIMPLE ^
-f ps2ascii.ps ^
-dFirstPage=3 ^
-dLastPage=7 ^
"c:/path/to/input.pdf" ^
-dQUIET
Text output will appear on stdout. If you test this in a cmd.exe window, you can redirect this to a file by appending > /path/to/output.txt
to the command.

Kurt Pfeifle
- 86,724
- 23
- 248
- 345