the process of getting data out of a PDF, this involves opening, reading and parsing the contents of the PDF to extract text, images, metadata or attachments
Questions tagged [pdf-scraping]
144 questions
4
votes
3 answers
How to scrape PDFs using Python; specific content only
I am trying to get data from PDFs available on the site
https://usda.library.cornell.edu/concern/publications/3t945q76s?locale=en
For example, If I look at November 2019…

Camilia
- 61
- 1
- 1
- 2
4
votes
1 answer
How to extract content from pdf file in react-native
I am working on a personal project where I want to have a functionality where I can pick up a pdf file from the file system and read the content of it by ANYHOW.
I tried every possible library out there but nothing works and most of them no support…

Hesham A. Othman
- 43
- 2
- 6
4
votes
1 answer
Headers are not getting extracted from PDF while extracting the table data from PDF using camelot
I am using camelot for table data extraction, however header are not getting extracted as part of the PDF.
Attaching the target
PDF link below and target table are at page number 3 and 4, which need to…

Abhishek Bisht
- 138
- 1
- 10
4
votes
0 answers
Identifying tables with gridlines in a pdf using python with tabula
I'm trying to extract all the tables that are contained in a pdf document (about 250 pages). The problem is not extraction. Problem is identifying the tables. With my algo it is taking junk data too like contents, sometimes bullet points which I…

Mehul Verma
- 123
- 1
- 8
4
votes
0 answers
Issue with downloading PDF via Puppeteer if it opens in a new tab
I am trying to download a pdf upon clicking a button. But I am not able to download it as it opens in new tab and not downloading it.
Following Solutions I have tried, but nothing seems to be working. Please help me .
1) Listening to the…

Mallikarjun
- 158
- 9
4
votes
5 answers
What's a good method for extracting text from a PDF using C# or classic ASP (VBScript)?
Is there a good library for extracting text from a PDF? I'm willing to pay for it if I have to.
Something that works with C# or classic ASP (VBScript) would be ideal and I also need to be able to separate the pages from the PDF.
This question had…

Mark Biek
- 146,731
- 54
- 156
- 201
4
votes
1 answer
pdf2txt.py not executing command
Whenever I use pdf2txt.py on my command line the source file opens and the command does not execute. I've just installed the packages and haven't been able to get it to run. For example, I will type the command:
pdf2txt.py -c UTF-8 output.txt "my…

user3368835
- 357
- 2
- 7
- 15
3
votes
5 answers
Is there a way to remove unwanted spaces from a string using Python or some NLP technique?? (NOT trailing or extra spaces)
s = "Over 20 years, this investment is cost neutral as it is covered by a modest ‚comfort ch arge™ Œ less than the equivalent energy bills would have been Œ based on the well -proven EnergieSprong model. Capital Budget Rather than speculatively…

zackakshay
- 41
- 2
3
votes
3 answers
Title Extraction/Identification from PDFs
I have a large number of pdfs in different formats. Among other things, I need to extract their titles (not the document name, but a title in the text). Due to the range of formats, the titles are not in the same locations in the pdfs. Further, some…

Evan Mata
- 500
- 1
- 6
- 19
3
votes
1 answer
How to download pdf from print preview using puppeteer
In puppeteer, I am trying to download the invoice. when I click on download button, it opens the print preview dialogue. Is there a way to save the pdf from print preview window?
The content inside print preview is not same as page that rendered,…

Mallikarjun
- 158
- 9
3
votes
0 answers
Issue with puppeteer navigating to pdf document when headless is true
I m trying to scrape a pdf file using puppeteer.
Upon clicking on the button, it navigates to pdf file, but puppeteer fails to render or it is not able to navigate to the pdf document. The response is null.
If headless is false, then pdf renders…

Mallikarjun
- 158
- 9
3
votes
1 answer
Why are GetTextFromPage from iTextSharp returning longer and longer strings?
I am using the latest iTextSharp lib from nuGet (5.5.8) to parse some text from a pdf-file. The problem I am facing is that GetTextFromPage method does not only return the text from the page that it should, it also returns the text from the previous…

Espo
- 41,399
- 21
- 132
- 159
3
votes
2 answers
Scraping Unstructured Information from a PDF
I am looking to scrape information from the this PDF into the following format:
I have circled the areas in the PDF where the information will come from.
As you can see, the formatting of this PDF is highly unstructured and to make matters worse,…

mchangun
- 9,814
- 18
- 71
- 101
2
votes
0 answers
I have extract the pdf file using python tika but i want to extract header and footer details. so how can i get that one?
import tika
from tika import parser
FileName = "sample.pdf"
PDF_Parse = parser.from_file(FileName)
print(PDF_Parse ['content'])
print(PDF_Parse ['metadata'])
but i want to extract header and footer details.what should i do??? using python tika???

jothi prabu
- 21
- 1
2
votes
1 answer
PDF scraping: get company and subsidiaries tables
I am trying to scrape this PDF containing information about company subsidiaries. I have seen many posts using the R package Tabulizer but this, unfortunately, doesn't work on my Mac for some reasons. As Tabulizer uses Java dependencies, I tried…

Amleto
- 584
- 1
- 7
- 25