the process of getting data out of a PDF, this involves opening, reading and parsing the contents of the PDF to extract text, images, metadata or attachments
Questions tagged [pdf-scraping]
144 questions
-1
votes
1 answer
How can i use regex in my pdfminer code to extract text between two headings?
I have several PDFs that i want to extract data from. I have managed to use the code below to extract all the data from the PDF however now i want to extract text between two different headings. I believe using regex is the best way to do this as…

Jlingz14
- 47
- 6
-1
votes
1 answer
How to extract corresponding column data from pdf
The pdf contains data separated line after line and there is a table after a line ,that contains heading and its corresponding value below it , i am unable to get it in an orderly manner ,but rather i get the complete column header one after the…

senor elanza
- 41
- 10
-1
votes
2 answers
How to find a specific line of text in a text file with python?
def match_text(raw_data_file, concentration):
file = open(raw_data_file, 'r')
lines = ""
print("Testing")
for num, line in enumerate(file.readlines(), 0):
w = ' WITH A CONCENTRATION IN ' + concentration
if…

M. Barbieri
- 512
- 2
- 13
- 27
-1
votes
1 answer
How download linked pdf files from website?
I want to download hundreds of pdf documents from a site. I have tried tools such as SiteSucker and similar, but it does not work, because there appears to be some "separation" between the files and the page that links to them. I don't know how to…

Magnus
- 1
- 1
-1
votes
2 answers
Python - How to convert many separate PDFs to text?
Question: How can I read in many PDFs in the same path using Python package "slate"?
I have a folder with over 600 PDFs.
I know how to use the slate package to convert single PDFs to text, using this code:
migFiles = [filename for filename in…

EJS
- 1
- 1
- 2
-2
votes
1 answer
Python PDF text extraction - Unable to extract from a specific document with pdfminer/textract
I am using Python to do a project which involves extracting text from many PDF documents, interestingly I've come across a document which is unable to be parsed by either of these…

blackfireize
- 29
- 4
-3
votes
2 answers
How to separate words from an element in a list?
My list looks like the following:
['https://www.enbridge.com/Projects-and-Infrastructure/For-Shippers/Tariffs/Enbridge-Bakken-Pipeline-Company-Inc-Bakken-Canada-tariffs.aspx/~/media/Enb/Documents/Tariffs/2021/BAK CAN CER 37.pdf',…

Amelia
- 3
- 1
-3
votes
2 answers
Extraction of tables from PDF
I have a pdf file containing text, images and tables.I want to extract just the tables from that pdf file using either Python or R.

TayyabRahmani
- 123
- 8
-4
votes
2 answers
how to transform a .pdf file to a .csv
The file is divided into continents and its countries , i want continents to be as column headers.
I have tried many things but unable to perform the action.
here's the link to the pdf file

rohit sharma
- 11