Questions tagged [pdf-scraping]

the process of getting data out of a PDF, this involves opening, reading and parsing the contents of the PDF to extract text, images, metadata or attachments

144 questions
-1
votes
1 answer

How can i use regex in my pdfminer code to extract text between two headings?

I have several PDFs that i want to extract data from. I have managed to use the code below to extract all the data from the PDF however now i want to extract text between two different headings. I believe using regex is the best way to do this as…
Jlingz14
  • 47
  • 6
-1
votes
1 answer

How to extract corresponding column data from pdf

The pdf contains data separated line after line and there is a table after a line ,that contains heading and its corresponding value below it , i am unable to get it in an orderly manner ,but rather i get the complete column header one after the…
senor elanza
  • 41
  • 10
-1
votes
2 answers

How to find a specific line of text in a text file with python?

def match_text(raw_data_file, concentration): file = open(raw_data_file, 'r') lines = "" print("Testing") for num, line in enumerate(file.readlines(), 0): w = ' WITH A CONCENTRATION IN ' + concentration if…
M. Barbieri
  • 512
  • 2
  • 13
  • 27
-1
votes
1 answer

How download linked pdf files from website?

I want to download hundreds of pdf documents from a site. I have tried tools such as SiteSucker and similar, but it does not work, because there appears to be some "separation" between the files and the page that links to them. I don't know how to…
Magnus
  • 1
  • 1
-1
votes
2 answers

Python - How to convert many separate PDFs to text?

Question: How can I read in many PDFs in the same path using Python package "slate"? I have a folder with over 600 PDFs. I know how to use the slate package to convert single PDFs to text, using this code: migFiles = [filename for filename in…
EJS
  • 1
  • 1
  • 2
-2
votes
1 answer

Python PDF text extraction - Unable to extract from a specific document with pdfminer/textract

I am using Python to do a project which involves extracting text from many PDF documents, interestingly I've come across a document which is unable to be parsed by either of these…
-3
votes
2 answers

How to separate words from an element in a list?

My list looks like the following: ['https://www.enbridge.com/Projects-and-Infrastructure/For-Shippers/Tariffs/Enbridge-Bakken-Pipeline-Company-Inc-Bakken-Canada-tariffs.aspx/~/media/Enb/Documents/Tariffs/2021/BAK CAN CER 37.pdf',…
Amelia
  • 3
  • 1
-3
votes
2 answers

Extraction of tables from PDF

I have a pdf file containing text, images and tables.I want to extract just the tables from that pdf file using either Python or R.
-4
votes
2 answers

how to transform a .pdf file to a .csv

The file is divided into continents and its countries , i want continents to be as column headers. I have tried many things but unable to perform the action. here's the link to the pdf file
1 2 3
9
10