Questions tagged [pdf-scraping]

the process of getting data out of a PDF, this involves opening, reading and parsing the contents of the PDF to extract text, images, metadata or attachments

144 questions
1
vote
0 answers

How to extract data from messy PDF file with no standard formatting?

I am working on this PDF file to parse the tabular data out of it. I was hoping to use tabula or PyPDF2 to extract tables out of it but the data in PDF is not stored in tables. So, I chose pdfplumber to extract text out of it. Until now, I am able…
1
vote
0 answers

How to highlight text on PDF using Python?

I am trying to make a python script that allows the user to input a PDF, then the user would input words to be searched for, if those words are found, highlight and exported as a unique file name. I have code that runs if the words aren't found, but…
1
vote
0 answers

Looping over a function to scrape PDF in R

Almost absolute beginner here. I have a function to scrape a table in PDF (I took and slightly adapted the function from here). The function is as follows. scrape_pdf <- function(tables, table_number, number_columns, column_names) { data <-…
srocco
  • 108
  • 7
1
vote
1 answer

How to webscrape PDFs that are hidden under the selection option?

I am trying to download >100 pdf from a website using python. However, those pdfs are hidden under the selection option. For example: Option 1 Option 2 Option 3 ... Then, if I choose Option 1, I something lie this: Option 1 Clickable Link to…
Isaac A
  • 543
  • 1
  • 6
  • 18
1
vote
3 answers

Is it possible extract a specific table with format from a PDF?

I am trying to extract a specific table from a pdf, the pdf looks like the image below I tried with different libraries on python, With tabula-py from tabula import read_pdf from tabulate import tabulate df = read_pdf("./tmp/pdf/Food Calories…
coding
  • 917
  • 2
  • 12
  • 25
1
vote
2 answers

Creating columns from scraped pdf with cuts on spaces

I'm trying to create a data frame from the following PDF library(tabulizer) url <- "https://doccs.ny.gov/system/files/documents/2020/06/doccs-covid-19-confirmed-by-facility-6.30.2020.pdf" tab1 <- extract_tables(url) However, when I call tab1 it…
babybonobo
  • 89
  • 8
1
vote
1 answer

Build a PDF file Manually from scratch and embed images

I'm trying to generate a PDF file programmatically. The entire case is: I'm receiving a multiple page PDFS. Each page is an image, with the contents i want. I don't want to use external libraries because i'm looking for performance \ optimization…
paboobhzx
  • 109
  • 10
1
vote
0 answers

trying to extract data from pdf and make sense of it and upload it to a database

Ive got many PDF's which contain data like name , Address , Contact info , Email Id's and many more details. i am trying to write a program to convert this data into Text file and using different methods to extract info. i used methods like…
1
vote
1 answer

Get text data from a pdf with python

I am stuck with how to deal with pdfs here. I dont know how to scrape directly from the web, and when I download locally they are complete nonsense, not the actual text data. I have tried to download with requests but the contents is then just…
derric-d
  • 75
  • 1
  • 9
1
vote
1 answer

Extracting/Scraping PDF with Textract - Doesn't print text

I am trying to extract the text in doem PDF files using Textract. However, when I print the text in the end of the code, it just prints out a lot of empty spaces. Can anyone point me in direction of what is going on? (text is not = "", by the…
1
vote
1 answer

Naming mutliple xlsx files with TRUE of FALSE if character string is present in a particular sheet

This code reads a xlsx file and creates individualy named files based on sheet number and a value found at a particular location (in this case temp[2,1]). However because each file and sheet is slightly different the names are inconsistant.…
Bohnston
  • 69
  • 10
1
vote
1 answer

Extracting data from a table of pdf to a structured format

I want to scrape the pdf table data in any structured format like html,xml,json. I am using python . I am first converting the pdf to text using pdftotext command line function. but it I am not able to distinguish the data of a table in the pdf. The…
1
vote
1 answer

Is it possible to automate running PDFelement using command line

I am currently trying to parse some PDF with tables to formats like csv/excel so that I can then programmatically process them with python, etc. I have found that PDFElement does a good job converting PDF to excel, but have only been doing…
1
vote
0 answers

How can I automate a daily report by scraping data from software and then have it be sent to a recipient by email every day?

I'm nearly acquainted with programming, but I'm still learning how to properly design a program. Here's what I want to do: MY SITUATION: I work at a hotel. Everyday the check-in software we have automatically generates analytical reports regarding…
Mikey
  • 64
  • 6
1
vote
0 answers

scraping pdf with empty fields in some lines

I'm trying to get the CUSIP NO. and STATUS from this pdf. I only want the lines which have the field STATUS present("added" or "deleted"). The problem I currently have is that I don't know how to get both fields because STATUS field is not present…
mfalcon
  • 880
  • 1
  • 11
  • 25