0

i want to ask one think about pdfs.

So i want to get out some data from pdf, but only specified data. Is it possible to choose what to get out from pdf?

For example is this image, so you can see which data i want to put out from pdf: pic http://shrani.si/f/1k/AA/Ph2cBYG/informativna-ponudba-gre.png

thanks

000
  • 26,951
  • 10
  • 71
  • 101

1 Answers1

0

This question touched two major processes: OCR and Data Capture (or parsing)

OCR stands for Optical Character Recognition. This process converts images to text. You will have to use this category of software if your PDFs are image-only PDFs (no text layer, such as scan, fax, rasterized, etc.). If your PDF already contains electronic text data, you 'may' be able to skip this step.

Data Capture standard for intelligent data location and extraction, such as finding specific fields among all other text. There are specialized software packages and/or parsing processes for that (see my previous post here).

If all your docs have the same 'area' that contains your text, you can crop the images, then pass smaller zones to OCR, which in turn will simplify your text extraction logic (because there will be less text to deal with).

ilya

Community
  • 1
  • 1
Ilya Evdokimov
  • 1,374
  • 11
  • 14
  • Hello, thanks for your answer. My PDFs are computer produced, so they are not scanned. PDFs are always the same, just some times are just 2-3 numbers, some time there are 6-7 rows with numbers..so because of that i don't know how to catch just this numbers no mater how many rows are there.. – user2352034 May 07 '13 at 20:24
  • 1
    Computer-generated PDFs may also be image-only, or with text layer, depends on generator. Try opening it in Acrobat reader and selecting or searching some value. If you find it or can select it, then you have text layer, and 'may' be able to skip OCR part. PDFs are not friendly for text parsing at all, because they give you no formatting information. If you have consistency, then it may be possible to write a simple parsing logic to will look for data types in certain predictable places. Sometimes I go OCR + data capture even for text-based PDFs, because easier to work with image objects. – Ilya Evdokimov May 07 '13 at 21:57
  • If you'd like, send me a couple of different variations, and I'll test my tools on them. ilya @ wisetrend.com – Ilya Evdokimov May 07 '13 at 21:58
  • I sent you example of my PDFs to your mail. Thanks for your help – user2352034 May 08 '13 at 09:25