the process of getting data out of a PDF, this involves opening, reading and parsing the contents of the PDF to extract text, images, metadata or attachments
Questions tagged [pdf-scraping]
144 questions
1
vote
0 answers
How to extract data from messy PDF file with no standard formatting?
I am working on this PDF file to parse the tabular data out of it. I was hoping to use tabula or PyPDF2 to extract tables out of it but the data in PDF is not stored in tables. So, I chose pdfplumber to extract text out of it. Until now, I am able…

Aamir Khan Maarofi
- 157
- 2
- 13
1
vote
0 answers
How to highlight text on PDF using Python?
I am trying to make a python script that allows the user to input a PDF, then the user would input words to be searched for, if those words are found, highlight and exported as a unique file name. I have code that runs if the words aren't found, but…

wonderingcat99
- 11
- 3
1
vote
0 answers
Looping over a function to scrape PDF in R
Almost absolute beginner here.
I have a function to scrape a table in PDF (I took and slightly adapted the function from here).
The function is as follows.
scrape_pdf <- function(tables, table_number, number_columns, column_names) {
data <-…

srocco
- 108
- 7
1
vote
1 answer
How to webscrape PDFs that are hidden under the selection option?
I am trying to download >100 pdf from a website using python. However, those pdfs are hidden under the selection option. For example:
Option 1
Option 2
Option 3
...
Then, if I choose Option 1, I something lie this:
Option 1
Clickable Link to…

Isaac A
- 543
- 1
- 6
- 18
1
vote
3 answers
Is it possible extract a specific table with format from a PDF?
I am trying to extract a specific table from a pdf, the pdf looks like the image below
I tried with different libraries on python,
With tabula-py
from tabula import read_pdf
from tabulate import tabulate
df = read_pdf("./tmp/pdf/Food Calories…

coding
- 917
- 2
- 12
- 25
1
vote
2 answers
Creating columns from scraped pdf with cuts on spaces
I'm trying to create a data frame from the following PDF
library(tabulizer)
url <- "https://doccs.ny.gov/system/files/documents/2020/06/doccs-covid-19-confirmed-by-facility-6.30.2020.pdf"
tab1 <- extract_tables(url)
However, when I call tab1 it…

babybonobo
- 89
- 8
1
vote
1 answer
Build a PDF file Manually from scratch and embed images
I'm trying to generate a PDF file programmatically.
The entire case is: I'm receiving a multiple page PDFS. Each page is an image, with the contents i want. I don't want to use external libraries because i'm looking for performance \ optimization…

paboobhzx
- 109
- 10
1
vote
0 answers
trying to extract data from pdf and make sense of it and upload it to a database
Ive got many PDF's which contain data like name , Address , Contact info , Email Id's and many more details.
i am trying to write a program to convert this data into Text file and using different methods to extract info.
i used methods like…

suyash joshi
- 61
- 9
1
vote
1 answer
Get text data from a pdf with python
I am stuck with how to deal with pdfs here. I dont know how to scrape directly from the web, and when I download locally they are complete nonsense, not the actual text data.
I have tried to download with requests but the contents is then just…

derric-d
- 75
- 1
- 9
1
vote
1 answer
Extracting/Scraping PDF with Textract - Doesn't print text
I am trying to extract the text in doem PDF files using Textract.
However, when I print the text in the end of the code, it just prints out a lot of empty spaces.
Can anyone point me in direction of what is going on? (text is not = "", by the…
1
vote
1 answer
Naming mutliple xlsx files with TRUE of FALSE if character string is present in a particular sheet
This code reads a xlsx file and creates individualy named files based on sheet number and a value found at a particular location (in this case temp[2,1]). However because each file and sheet is slightly different the names are inconsistant.…

Bohnston
- 69
- 10
1
vote
1 answer
Extracting data from a table of pdf to a structured format
I want to scrape the pdf table data in any structured format like html,xml,json.
I am using python . I am first converting the pdf to text using pdftotext command line function. but it I am not able to distinguish the data of a table in the pdf.
The…

Shivam Singh
- 21
- 4
1
vote
1 answer
Is it possible to automate running PDFelement using command line
I am currently trying to parse some PDF with tables to formats like csv/excel so that I can then programmatically process them with python, etc.
I have found that PDFElement does a good job converting PDF to excel, but have only been doing…

SL12345
- 11
- 3
1
vote
0 answers
How can I automate a daily report by scraping data from software and then have it be sent to a recipient by email every day?
I'm nearly acquainted with programming, but I'm still learning how to properly design a program. Here's what I want to do:
MY SITUATION: I work at a hotel. Everyday the check-in software we have automatically generates analytical reports regarding…

Mikey
- 64
- 6
1
vote
0 answers
scraping pdf with empty fields in some lines
I'm trying to get the CUSIP NO. and STATUS from this pdf. I only want the lines which have the field STATUS present("added" or "deleted").
The problem I currently have is that I don't know how to get both fields because STATUS field is not present…

mfalcon
- 880
- 1
- 11
- 25