Questions tagged [pdf-scraping]

the process of getting data out of a PDF, this involves opening, reading and parsing the contents of the PDF to extract text, images, metadata or attachments

144 questions
0
votes
2 answers

Extract larger body of character data with stringr?

I am working to scrape text data from around 1000 pdf files. I have managed to import them all into R-studio, used str_subset and str_extract_all to acquire the smaller attributes I need. The main goal of this project is to scrape case history…
0
votes
1 answer

Referencing the last page in a PDF with tabula?

I want to reference the last page from a bunch of PDF documents and parse tables from it, however the number of pages in the documents can vary. What I do know is that the last page is the same for these documents. all_tables_stream =…
0
votes
1 answer

Scraping PDF in R with Nested Information

I am attempting to scrape a rather difficult PDF in R using both pdftools::pdf_text and tabulizer::extract_tables. However, in my situation, neither of these seems to be too helpful based on the nature of the PDF. The PDF contains "nested"…
mikeytop
  • 150
  • 9
0
votes
3 answers

How do I iterate through files in my directory so they can be opened/read using PyPDF2?

I am working on an invoice scraper for work, where I have successfully written all the code to scrape the fields that I need using PyPDF2. However, I am having trouble figuring out how to put this code into a for loop so I can iterate through all…
0
votes
1 answer

Scraping large and complex PDF tables

I've been trying to scrape some data off of PDFs regarding 2020 election results in California for my own morbid curiosity. I need to scrape many tables that appear across many pages. In some cases, the rows will continue onto the next page, and…
pkpto39
  • 545
  • 4
  • 11
0
votes
1 answer

How to return all extracted text from multiple PDFs in python?

This is my code. So far, it'll print all the content of the pdfs to the pages variable. However, I cannot seem to return the same extracted text. I've been testing it by pulling information from random pdfs and placing it in the folder I'm calling.…
0
votes
0 answers

How to extract a table from any file using python?

I'm writing a python program to extract tables from excel sheets and pdf. Currently, I'm using different libraries for each file type. Xlrd for excel sheets, Pdfminer for pdf. I'm wondering if there is a generic approach to extract tables from any…
Parag
  • 21
  • 2
0
votes
1 answer

Reading a table from a pdf file by row and not by column

I am trying to pull all of the text from a PDF file. I am using online PDF's, and they include tables. This code works, however, when it gets to a table in the PDF, the text from the table is printed by columns instead of rows which is messing up my…
0
votes
2 answers

Regular expression to remove first occurrence of letters in a determined order

I am trying to scrape a pdf with tables using python and the tabula package. In some cases, two columns are being extracted completely mixed up. I know that the column "Type" Should only have these two values: EE-Male or EE-Female. Thus, I need to…
0
votes
1 answer

Pandas DataFrame combine multi row spanning column

I have a complex scraped dataframe that looks like this: For context, the original data from a PDF looks like so: DataFrame info: RangeIndex: 26 entries, 0 to 25 Data columns (total 5 columns): # Column …
user1757703
  • 2,925
  • 6
  • 41
  • 62
0
votes
1 answer

KeyError: '/Contents'

When trying to get numbers from a pdf, using PyPDF2, I get: KeyError: '/Contents'. Here is the code: import PyPDF2 as pdf fhand = open('filepdf.pdf', 'rb') reader = pdf.PdfFileReader(fhand) if reader.isEncrypted == True: pass else: for i…
Endre
  • 1
  • 1
  • 4
0
votes
1 answer

converting plain text to data frame using dplyr in r

I'm trying to use r convert plain text scraped from a pdf with pdftools and tidyverse into a data frame. I'm hoping for a solution using tidyverse packages. I've used the following code to get to a list of strings with my essential…
user1988
  • 13
  • 3
0
votes
1 answer

How to extract data from multiple PDFs in the same directory using python-camelot?

I'm trying to extract data from multiple multiple tables in multiple pdf and save it in csv format. I did my research and found python-camelot is good tool to extract. I tried and it works perfectly fine on a single pdf. However, I have over 50 PDFs…
Ahmad B
  • 1
  • 1
  • 3
0
votes
1 answer

Extracting strings from a PDF with R

I have this PDF file from European parliament, that you can download here. I have downloaded it and put it in R. It contains lists of names of Members of European Parliament (MEP) after a session of vote. I want to extract just bits of these lists.…
hug
  • 247
  • 4
  • 14
0
votes
1 answer

Extraction of financial statements from pdf reports

I have been trying to pull out financial statements embedded in annual reports in pdf and export them in excel/CSV format using python But I am encountering some problems: 1. A specific Financial statement can be on any page in the report. If I were…
Adil Saleem
  • 3
  • 1
  • 2