the process of getting data out of a PDF, this involves opening, reading and parsing the contents of the PDF to extract text, images, metadata or attachments
Questions tagged [pdf-scraping]
144 questions
0
votes
1 answer
Create single page PDF from multi page PDF WITHOUT external libraries
I've saw the following question around SO:
Create Multi-Page PDF from other PDFs
But it didn't replied what i need.
Consider i have an PDF with 20 pages. So far so good.
From the same place, i can have a PDF with only one page. This one will be used…

paboobhzx
- 109
- 10
0
votes
2 answers
How to convert a scanned PDF file to Editable PDF file with python?
I just need to know if we can convert a scanned pdf file to an editable pdf file using python. I know couple of libraries out there like pytesseract, pyocr. Guidance in this regard will be highly appreciated. Thanks

Umar Aftab
- 147
- 4
- 15
0
votes
0 answers
I need to scrape data from 100 Microsoft Word documents and create a table in a CSV file
I have 100s of Microsoft Word documents. Each document has the same headers. I need to be able to read the data present in those documents and create a table. Output in the form of a CSV file.
I tried to use Scrapy. But I am new, and I don’t know…
0
votes
0 answers
Data Scraping from PDF
I am trying to collect data from a pdf using the R tabulizer package. However then I got an error when I try to convert the data to a dataframe and export it to CSV. My code is below. Could someone help me with this?
# Library packages
if…

Caíque Melo
- 1
- 1
0
votes
1 answer
Extract PDF data with varying white space as seperation
I'm looking at getting data from this PDFs.
I'm running into a problem, where location names with multiple words ("Northern Island" for example) are being put into different columns.
The "sep" argument within "read.table" seems to only be able to…

deetseeker
- 3
- 2
0
votes
0 answers
Copy website data with SendKeys ("^a") and Paste in Excel
I have VBA Code where I am able to browse webpage but Sendkeys "^a", "^c" not working. Tried multiple times but no luck.
Please suggest.
Using this code:
Set DestinationWorksheet = ActiveSheet
Dim myURL As String
Dim HTMLdoc As HTMLDocument
…

Mukunda Adhikari
- 11
- 3
0
votes
2 answers
Is it possible for a PDF data parser to read PowerPoint PDFs?
I am currently developing a proprietary PDF parser that can read multiple types of documents with various types of data. Before starting, I was thinking about if reading PowerPoint slides was possible. My employer uses presentation guidelines that…

Mashiyath Haque
- 9
- 1
0
votes
1 answer
How to read a PDF up to a certain end line?
I am doing for loop for many research papers. Here I want extract from read document a content.
How can I make that R reads only until last line, where many dots are, and indicate as an end-line? like on the picture below:
[Numbers]…

Bakai Baiazbekov
- 61
- 4
0
votes
0 answers
pdftools - Helvetica (?) font distorts text import
I am struggling to properly read pdfs which contain the Helvetica font with the pdftools package.
I am trying to extract info from about a 1000 voting records. Overall, pdftools works as intended. However, there are one or two hundred pdfs where the…

zoowalk
- 2,018
- 20
- 33
0
votes
1 answer
Cleaning up text data extracted from scanned .pdf
I am creating a script to extract text from a scanned pdf to create a JSON dictionary for implementation into a MongoDB later. The issue I have run into is that using tesseract-ocr via Textract module successfully extracted all the text but it is…

Brett Plemons
- 65
- 9
0
votes
1 answer
No text is returned when pypdf2 is used to scrape a one paged pdf
I have downloaded a bunch of pdfs from this source: 'http://ec.europa.eu/growth/tools-databases/cosing/index.cfm?fuseaction=search.detailsPDF_v2&id=28157
Now I want to scrape the PDF's by using PyPDF2, however no text is returned.
I tested the code…

Mr Anderson
- 7
- 3
0
votes
1 answer
Trying to scrape a PDF in R, my code will only scrape 6 out of 9 pages and i'm not sure why, am I missing something in my code?
Im trying to scrape a couple PDFs in R, PDF1 has 9 pages and PDF2 has 12 pages. When I run the code below it scrapes both PDFs but only up to page 6 and nothing after that. Is there a reason for this? Something missing in my code?
library(tm)
read…

Jlingz14
- 47
- 6
0
votes
1 answer
Converting a PDF file to a nice table
I have this PDF file which is arranged in 5 columns.
I have looked and looked through Stack Overflow (and Googled crazily) and tried all the solutions (including the last resort of trying Adobe Acrobat itself).
However, for some reason I cannot…

econclicks
- 327
- 1
- 5
- 11
0
votes
0 answers
Extract text from PDF section keeping strings in one line
I have a bunch of PDF files and I need to extract some information from them. The "section" have the text "Referências" and looks like the picture below:
I tried a lot of text extractor tools to accomplish this task, but the problem is that I need…

Wolgan Ens
- 385
- 1
- 11
0
votes
1 answer
Is there an easy way to find specific text in a PDF, highlight it and print OR save to new file?
So what I'm hoping to do is automate mapping out process of desk locations in a building layout map that is in PDF format.
I work with a deployment team that handles IT equipment requests.. and basically we get requests with a list of user names and…

Vvega
- 37
- 6