Questions tagged [pdf-scraping]

the process of getting data out of a PDF, this involves opening, reading and parsing the contents of the PDF to extract text, images, metadata or attachments

144 questions
0
votes
1 answer

How to get chars/words/lines/blocks coordinates

I'm doing pdftotext -bbox file.pdf and that produces word-level output. Is there a way to output coordinates on the character/phrase/line/block level? I'm interested in knowing if either the poppler or xpdf version of pdftotext can do this.
0
votes
0 answers

What information does a PDF document store with regards to bulleted lists?

I am trying to extract text out of a PDF document. I am wondering how does PDF handle bulleted paragraphs. Consider this example: Does PDF retain any logical meta-information that the 2 chunks of text shown above are members of a bulleted list…
Sau001
  • 1,451
  • 1
  • 18
  • 25
0
votes
0 answers

Extracting handwritten information from a pdf copy

I am working on cataloguing a set of records. Converting the paper records to PDF and then to text is not much of a problem. The main issue that I am facing is associated with hand written entries in forms. The pdfs are all copies of forms that were…
JWH2006
  • 239
  • 1
  • 11
0
votes
1 answer

Quasi xml extracting text bewtween 2 start tags

I've scraped some data from a pdf. It has data thats almost like XML and looks something like this "(1) Data-field-1 (3) Data-field-3 (5) Data-field-5; (1) Data-field-1 (2) Data-field-2 (3) Data-field-3 (5) Data-field-5; ; (2) Data-field-2 (3)…
0
votes
0 answers

Deceptively easy looking PDF conversion that is causing me fits

I have had tons of success using Tabula to convert PDFs to CSV files, but this particular one is causing me all kinds of issues. The file can be found at here. It seems the multiple row spans is causing Tabula headaches. I would not expect Tabula…
0
votes
0 answers

Retrieve PDF Data

I'm currently working on windows application that is loading electoral roll pdf file. What I'm trying to do is to get the data as per Sr. No., Epic No., Name, Father's / Husbands Name, Age, Sex, House No. and pincode. Data is available in 3 columns…
0
votes
1 answer

PDF data extraction

Is there a way for me to take a scanned PDF image and extract data from the image by highlighting the fields that are needed? We scan thousands of PDF images of real estate deeds daily and would like to be able to automate the data entry process.…
C.Roddy
  • 1
  • 1
0
votes
1 answer

Decode JPEG image stripped from inside a PDFs file

I have code that decompresses jpgs into bit maps which works fine for JPEG files, however when I feed the code a JPEG I have stripped directly from a PDFs XObject I get errors. Adobe reader displays the image fine so I don't believe it's corrupted.…
Joe
  • 1
0
votes
1 answer

Extract data from PDF document

I have a PDF document. It contains data in tabular format. I want to extract the data into a comma delimited text file using the comma as column delimiters. Any suggestions?
user3079559
  • 417
  • 5
  • 16
0
votes
0 answers

just like scraping data off the web , either from html or json , can the same be done in pdfs using R?

I would like to import tables and table-like data in research articles(pdf files) into R. example : http://www.bioconductor.org/packages/release/bioc/vignettes/DESeq/inst/doc/DESeq.pdf thats the pdf taken as an example here. Simple tables to start…
user3563667
  • 293
  • 4
  • 14
0
votes
1 answer

Unable to click the anchor tags for the second time

I am writing an scraping program. The very first time i am able to click the anchor tags, But once I again loop in the same doesn't happen. I have done this in the Watin instance of IE. I doubt this is because of the back of the IE instance which i…
user2703389
  • 91
  • 1
  • 10
0
votes
2 answers

How to read line by line in pdf file and create a CSV

Here is my pdf I found THIS and I used it to scrap my pdf. 6 BEDROOMS NameAddressUnitSizeKeyRentSq FtMove in DateNotesTenant Prop # Texan 261009 West 26th3076x3$4,6952,1368/15/14$1,000 Bonus (1) Park - Its pretty mixed up. or Is is because…
Alexxio
  • 1,091
  • 3
  • 16
  • 38
0
votes
4 answers

optical character recognition of PDFs of parliamentary debates

For a contract work, I need to digitalize a lot of old, scanned-graphic-only plenary debate protocol PDFs from the Federal Parliament of Germany. The problem is that most of these files have a two-column format: Sample Protocol…
Cetin Sert
  • 4,497
  • 5
  • 38
  • 76
-1
votes
1 answer

Scrape data from PDF with python but not from a table or a normal te

Hello guys and thank you in advance for helping me. So basically, i am trying scrape data from a pdf. this is the pdf data: what i want to do is extract data from it like that: i tried to do it with tabula but gave me this: and i tried with…
tous
  • 37
  • 8
-1
votes
2 answers

Extract metadata info from online pdf using pdfminer in python

I am interested to find out some metadata of an online pdf using pdfminer. I am interested in extracting info such as Title, author, no of lines etc from the pdf I am trying to use a related solution discussed…
1 2 3
9
10