the process of getting data out of a PDF, this involves opening, reading and parsing the contents of the PDF to extract text, images, metadata or attachments
Questions tagged [pdf-scraping]
144 questions
0
votes
0 answers
Incredibly high volume async web scraping
I'm working on a project and I've discovered that the data I want is stored as auto-generated PDFs on the web (not indexed by search engines). The URLs follow a consistent pattern which is basically looks something like…

as9934
- 11
- 5
0
votes
1 answer
Using Text Mining in R to find a specific set of words in a set of PDFS
I am looking at a set of 10 PDFs, and I want to write code that will tell me the number of times a couple words I've predetermined appear in the document. So far, I've been using the pdftools function and tm function to find the frequency of most…

ZoeM
- 1
0
votes
1 answer
Scrapy script that was supposed to scrape pdf, doc files is not working properly
I am trying to implement a similar script on my project following this blog post here:
https://www.imagescape.com/blog/scraping-pdf-doc-and-docx-scrapy/
The code of the spider class from the source:
import re
import textract
from itertools import…

glitchy_itchy
- 29
- 7
0
votes
0 answers
URL Regex that detecs links that continues onto second line
I am using Python to scrape PDFs for links. I have a Regex that works for the most part.
URL_REGEX = r"""
(?i)\b
(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|
(?:
…

Marshal Miller
- 3
- 3
0
votes
1 answer
Extracting and Organizing Text From A PDF
I'm currently trying to scrape a bunch of information from PDF pages. I have managed to get some text extracted but haven't been able to extract everything or the format has been difficult to work with. I'm using this example to kind of extract…

Nhyi
- 373
- 1
- 12
0
votes
1 answer
pdfminer: extract only text according to font size
I only want to extract text that has font size 9.800000000000068 and 10.000000000000057 from my pdf files.
The code below returns a list of the font size of each text block and its characters for one pdf file.
Extract_Data=[]
for page_layout in…

id345678
- 97
- 1
- 3
- 21
0
votes
1 answer
How to scrape data from PDF into Excel
I am trying to scrape the data from PDF and get it saved into an excel file. This is the pdf I needed: https://www.medicaljournals.se/acta/content_files/files/pdf/98/219/Suppl219.pdf
However, I need to scrape not all the data but the following one…

classicandy
- 13
- 3
0
votes
3 answers
How to parse the drop down list and get the all the links for the pdf using Beautiful Soup in Python?
I'm trying to scrape the pdf links from the drop down this website. I want to scrape just the Guideline Values (CVC) drop down. Following is the code that i used but did not succeed
import requests
from bs4 import BeautifulSoup
req_ses =…

techwreck
- 53
- 1
- 12
0
votes
1 answer
Is there a way to extract images from a pdf in Python while preserving the location of the image in the pdf?
I need to extract images from a pdf without losing its location in the pdf. I need to know which page the image is on and where in the text the image is located, and then save the text and images in the pdf to a json file with the sequence of the…

Manasi Gowda
- 1
- 1
0
votes
0 answers
How to extract text from rotated PDF without saving it from web response object using PyPDF2 or any other package?
I want to extract text from this link. Here the pdf is rotated and I'm getting a blank response or empty string when i try to rotate it and extract and even if i simply try to extract text then also I'm getting blank response/ empty string. Please…

techwreck
- 53
- 1
- 12
0
votes
0 answers
Tabula-py: reading tables from a pdf that contains form fields
I'm trying to read a pdf that contains multiple tables that have form fields for ticks/checkmarks free text, numbers, dropdown selections etc.
Unfortunately the dataframes that are returned don't render the information contained in the pdf…

gokepler
- 1
- 1
0
votes
1 answer
trying to scrape from long PDF with different table formats
I am trying to scrape from a 276-page PDF available here: https://www.acf.hhs.gov/sites/default/files/documents/ocse/fy_2018_annual_report.pdf
Not only is the document very long but it also has tables in different formats. I tried using the…

Jennifer B.
- 163
- 1
- 4
- 10
0
votes
0 answers
How to read persian pdf and scrape its contents?
I am trying to read this persian pdf but the result is not decoded well. I also tried utf-16 or utf-32, but no readable results was produced. I want to get the persian dates inside the table.
Other libraries were tried but no good text was…

yasharov
- 113
- 10
0
votes
1 answer
Python PDF Scraping
Task:
PDF which is a bank statement,contains columns i.e (Date,Description,Deposits,Withdrawals,Balance) parsing the columns with their respective fields and export that data in CSV format.PDF.
My code:
import pdftotext
import re
import csv
# open…

Abbas
- 59
- 7
0
votes
1 answer
file handling + word scraping (trying to find all the words in a file that end with 'y')
ERROR: Traceback (most recent call last): File "c:\Users\Pranjal\Desktop\tstp\zen_scraper.py", line 5, in words = re.findall("$y",file) File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.8_3.8.2288.0_x64__qbz5n2kfra8p0\lib\re.py",…
user14143568