Questions tagged [pdf-scraping]

the process of getting data out of a PDF, this involves opening, reading and parsing the contents of the PDF to extract text, images, metadata or attachments

144 questions
2
votes
2 answers

Python PdfMiner - How to get the info on the orientation of each word/sentence included in a pdf?

Target: I want to extract the info on the orientation of each word or sentence from a PDF like the attached one. The reason for this is that i want to keep the text only from the orientation with zero degrees, not the 90,180 or 270 degrees. . What I…
2
votes
1 answer

Trouble using extract_tables() function in tabulizer package:

I am trying to scrape tables from a PDF but from my local directory rather than from a web-browser (as it is not opening directly into a browser). Yet, I download the pdf onto my local directory and trying from there to read my tables only! When I…
GaB
  • 1,076
  • 2
  • 16
  • 29
2
votes
1 answer

Camelot-py not detecting two lines of text in one row

Scraping table data from a .PDF using Camelot-py, and it is not picking up stacked lines of text (refer to rows 9 and 10 below) Rows 9 and 10 are void of text for account.…
Logan McNulty
  • 73
  • 1
  • 7
2
votes
0 answers

Python tabula returns the 'attributeError: module 'tabula' has no attribute 'read_pdf''

I working with Tabula to do some pdf scraping. However, when I run the: tables = tabula.read_pdf(file, pages = "all", multiple_tables = True) I get attributeError: module 'tabula' has no attribute 'read_pdf'. I tried most of solutions found on web,…
Blackchat83
  • 85
  • 1
  • 6
2
votes
0 answers

Trouble with tabulizer library in r recognizing non-alphanumeric (symbol) characters on a table in a PDF

I am using the tabulizer library in r to capture data from a table located inside a PDF on a public website (https://www.waterboards.ca.gov/sandiego/water_issues/programs/basin_plan/docs/update082812/Chpt_2_2012.pdf). The example table that I am…
2
votes
1 answer

R pdftools does not get all layers of pdf when converting to image

I have just started trying to use pdftools to extract images from pdfs. However I have found that not all layers are reproduced. For example in the code below the lines are reproduced in the png but not the points. Obviously in this example I could…
Sarah
  • 3,022
  • 1
  • 19
  • 40
2
votes
1 answer

HowTo extract embedded OCR data from a PDF?

I have PDF-files with embedded OCR data. (So I already orcd them) So they are searchable. Now I want to extract this OCR data, because I want to put in in my tomcat6 searchserver. For doing this, I need the plain OCR data. So my question is, is it…
erik
  • 21
  • 2
2
votes
3 answers

How to scrape a downloaded PDF file with R

I’ve recently gotten into scraping (and programming in general) for my internship, and I came across PDF scraping. Every time I try to read a scanned pdf with R, I can never get it to work. I’ve tried using the file.choose() function to no avail. Do…
2
votes
1 answer

PDF scraping using textract module

I have a Node.js app that have to do some web scraping of online pdf. This is a piece of code: var textract = require('textract'); const util = require('util'); var methods = {}; var urls = [ {year: '2016', link: 'http://www.url2016.pdf'}, …
user6118527
2
votes
1 answer

iTextSharp PDF Reading highlighed text (highlight annotations) using C#

I am developing a C# winform application that converts the pdf contents to text. All the required contents are extracted except the content found in highlighted text of the pdf. Please help to get the working sample to extract the highlighted text…
Binod
  • 313
  • 1
  • 2
  • 12
2
votes
2 answers

Parsing a PDF via URL with Python using pdfminer

I am trying to parse this file but without downloading it off of the website. I have run this with the file on my hard drive and I am able to parse it without issue but running this script it trips. if not document.is_extractable: raise…
user3271518
  • 628
  • 3
  • 13
  • 27
1
vote
1 answer

Extract the text of word documents by page instead of paragraph (R)

I currently have a (large) amount of text data in (hundreds of) .pdf and .docx files. I would like to extract the text per page as later in the analysis, page numbers become relevant. For the pdf files, I'm using the pdftools package, which works…
Rasul89
  • 588
  • 2
  • 5
  • 14
1
vote
2 answers

Scraping specific pdfs from different websites

First question here. I need to download a specific pdf from every url. I need just the pdf of the european commission proposal from each url that I have, which is always in a specific part of the page [Here the part from the website that I would…
1
vote
0 answers

Maintaining the sequence of the extracted text and images from the PDF while scrapping them in python

I am trying to extract text and images from a pdf using python using the library PyMuPdf. But unfortunately, I can't preserve the sequence of the image. for example, the Image is placed at the start of the page but while extracting it, the image is…
1
vote
0 answers

Scraping info out of pdf's using Python

I have pdf's distributed over several folders and sub folders. I've been trying to write a short python script with the idea to search each pdf for any term i enter. As not all pdf's are searchable, I also tried to implement a list of searchable,…
OldGrey
  • 83
  • 1
  • 1
  • 11
1 2
3
9 10