Highest Voted 'pdf-scraping' Questions

0

votes

2 answers

Extract larger body of character data with stringr?

I am working to scrape text data from around 1000 pdf files. I have managed to import them all into R-studio, used str_subset and str_extract_all to acquire the smaller attributes I need. The main goal of this project is to scrape case history…

asked Mar 01 '21 at 18:10

Averysaurus

31
4

0

votes

1 answer

Referencing the last page in a PDF with tabula?

I want to reference the last page from a bunch of PDF documents and parse tables from it, however the number of pages in the documents can vary. What I do know is that the last page is the same for these documents. all_tables_stream =…

python pandas pdf tabula pdf-scraping

asked Jan 21 '21 at 07:31

TesseractMonkey

15
3

0

votes

1 answer

Scraping PDF in R with Nested Information

I am attempting to scrape a rather difficult PDF in R using both pdftools::pdf_text and tabulizer::extract_tables. However, in my situation, neither of these seems to be too helpful based on the nature of the PDF. The PDF contains "nested"…

r pdf pdf-scraping pdftools tabulizer

asked Jan 20 '21 at 20:05

mikeytop

150
9

0

votes

3 answers

How do I iterate through files in my directory so they can be opened/read using PyPDF2?

I am working on an invoice scraper for work, where I have successfully written all the code to scrape the fields that I need using PyPDF2. However, I am having trouble figuring out how to put this code into a for loop so I can iterate through all…

python pdf pypdf pdf-scraping

asked Jan 02 '21 at 20:08

Tyler Watson

13
4

0

votes

1 answer

Scraping large and complex PDF tables

I've been trying to scrape some data off of PDFs regarding 2020 election results in California for my own morbid curiosity. I need to scrape many tables that appear across many pages. In some cases, the rows will continue onto the next page, and…

python r excel pdf-scraping

asked Dec 06 '20 at 18:40

pkpto39

545
4
11

0

votes

1 answer

How to return all extracted text from multiple PDFs in python?

This is my code. So far, it'll print all the content of the pdfs to the pages variable. However, I cannot seem to return the same extracted text. I've been testing it by pulling information from random pdfs and placing it in the folder I'm calling.…

python pdf machine-learning nlp pdf-scraping

asked Jul 19 '20 at 15:07

Swechha Ghimire

3
2

0

votes

0 answers

How to extract a table from any file using python?

I'm writing a python program to extract tables from excel sheets and pdf. Currently, I'm using different libraries for each file type. Xlrd for excel sheets, Pdfminer for pdf. I'm wondering if there is a generic approach to extract tables from any…

python extract pdfminer pdf-scraping petl

asked Jul 08 '20 at 16:56

Parag

21
2

0

votes

1 answer

Reading a table from a pdf file by row and not by column

I am trying to pull all of the text from a PDF file. I am using online PDF's, and they include tables. This code works, however, when it gets to a table in the PDF, the text from the table is printed by columns instead of rows which is messing up my…

python pdf datatables pdf-scraping

asked Jun 30 '20 at 17:03

Yodit Getahun

1
1

0

votes

2 answers

Regular expression to remove first occurrence of letters in a determined order

I am trying to scrape a pdf with tables using python and the tabula package. In some cases, two columns are being extracted completely mixed up. I know that the column "Type" Should only have these two values: EE-Male or EE-Female. Thus, I need to…

regex tabula pdf-scraping

asked Jun 26 '20 at 05:39

Belén Michel Torino

11
3

0

votes

1 answer

Pandas DataFrame combine multi row spanning column

I have a complex scraped dataframe that looks like this: For context, the original data from a PDF looks like so: DataFrame info: RangeIndex: 26 entries, 0 to 25 Data columns (total 5 columns): # Column …

python pandas dataframe web-scraping pdf-scraping

asked May 15 '20 at 17:48

user1757703

2,925
6
41
62

0

votes

1 answer

KeyError: '/Contents'

When trying to get numbers from a pdf, using PyPDF2, I get: KeyError: '/Contents'. Here is the code: import PyPDF2 as pdf fhand = open('filepdf.pdf', 'rb') reader = pdf.PdfFileReader(fhand) if reader.isEncrypted == True: pass else: for i…

python-3.x pdf pypdf pdf-scraping

asked May 10 '20 at 16:08

Endre

1
1
4

0

votes

1 answer

converting plain text to data frame using dplyr in r

I'm trying to use r convert plain text scraped from a pdf with pdftools and tidyverse into a data frame. I'm hoping for a solution using tidyverse packages. I've used the following code to get to a list of strings with my essential…

r dplyr pdf-scraping

asked Apr 05 '20 at 17:45

user1988

13
3

0

votes

1 answer

How to extract data from multiple PDFs in the same directory using python-camelot?

I'm trying to extract data from multiple multiple tables in multiple pdf and save it in csv format. I did my research and found python-camelot is good tool to extract. I tried and it works perfectly fine on a single pdf. However, I have over 50 PDFs…

python pdf-scraping python-camelot

asked Mar 11 '20 at 20:28

Ahmad B

1
1
3

0

votes

1 answer

Extracting strings from a PDF with R

I have this PDF file from European parliament, that you can download here. I have downloaded it and put it in R. It contains lists of names of Members of European Parliament (MEP) after a session of vote. I want to extract just bits of these lists.…

r regex string pdf pdf-scraping

asked Jan 21 '20 at 10:22

hug

247
4
14

0

votes

1 answer

Extraction of financial statements from pdf reports

I have been trying to pull out financial statements embedded in annual reports in pdf and export them in excel/CSV format using python But I am encountering some problems: 1. A specific Financial statement can be on any page in the report. If I were…

python pdf-scraping

asked Dec 17 '19 at 21:52

Adil Saleem

3
1
2

Questions tagged [pdf-scraping]