Extracting text and other data from a PDF document, regardless of the libraries used to achieve this.
Questions tagged [pdf-extraction]
148 questions
0
votes
0 answers
Generating ToUnicode CMaps (Programmatically or Visually)
I have several problematic PDFs, which I am attempting to convert to PDF/A-1a.
These documents utilize CID Identity-H embedded subsets, generated with Acrobat Distiller 20.0. I have performed searches for tools which could utilize OCR to scan the…

Kadaj Nakamura
- 923
- 1
- 10
- 24
0
votes
1 answer
Finding text coordinates using bytescout PDFExtractor C#
I have a PDF that I need to find and replace some text. I know how to create overlays and add text but I can't determine how to locate the current text coordinates. This is the example I found on the bytescout site -
// Create…

KarkMump
- 81
- 1
- 8
0
votes
0 answers
Error during change current work directory
When run the code bellow the line os.chdir(folder_path) return a error. What's wrong?
That's my folder hierarchy:
-data
-NotaCorretagem_60076_20181009.pdf
-output
-report
-script
-data_extraction.py
My data_extraction.py file code:
# import…
0
votes
0 answers
Data Extraction using tika library
Requirement is to pasre pdf and document file.
How to parse only required page for example in a doc / pdf file there are 10 pages. But requirement is to parse only Page 1 -3 and Last page.

Santosh Singh
- 11
- 2
0
votes
0 answers
Extract data from pdf boxes in R
PDF has boxes with data. I want to extract all the data from these boxes in R. I want this to be extracted without using OCR.
I have tried Tabulizer package but it is giving unorganized results making it impossible to extract.
report <-…

Dinesh Mandal
- 23
- 3
0
votes
1 answer
Passing a pdf file to a function when it requires a path or link
I am working on a web application for an online library. I want to extract metadata from the PDF's that will be uploaded and for that I am using the nodejs library pdf.js-extract and multer-gridfs-storage for the upload. The problem is that I am…

Luis de la Cal
- 41
- 8
0
votes
0 answers
Using PDFMiner.six Python3 prints weird characters to file
I am currently working with PDFMiner.six to extract text from multiple PDFs. Looking at my output I can see that I get some weird conversions of special characters like brakets:
Opening and closing brackets:
Finally, I delete all paragraphs 共defined…

Florian Schramm
- 333
- 3
- 15
0
votes
2 answers
How can I print the tables in a .pdf file using python
CalledProcessError: Command '['java', '-Dfile.encoding=UTF8', '-jar', 'C:\Users\vijv2c13136\AppData\Local\Continuum\anaconda2\lib\site-packages\tabula\tabula-1.0.2-jar-with-dependencies.jar', '--pages', 'all', '--guess', '--format', 'JSON',…

A.Viji
- 23
- 2
- 7
0
votes
1 answer
How to use page.filter(test_function) in PDFPlumber library?
I am trying to delete tables inside the a pdf page and I'm trying to use page.filter() function for that, here I have table bbox coordinates and I am trying to compare if object coordinates are inside the table coordinates or not.
But I was unable…

Satyaaditya
- 537
- 8
- 26
0
votes
2 answers
Tabula CalledProcessError: returned non-zero exit status 2. Tried everything possible
I keep getting this error while using Tabula on python.
I've gone through EVERY stackoverflow question related to this and blogs as well.
My JDK JRE is up to date.
java version "1.8.0_161"
Java(TM) SE Runtime Environment (build 1.8.0_161-b12)
Java…

Pai
- 1
- 5
0
votes
1 answer
Confluence: Is there a way to use space variables in Global PDF Stylesheet? Or somehow include it on PDF Exports
For PDF exports, I'm trying to export the space name to the bottom center of the export.
I tried the following but no luck so far:
@bottom-center
{
content: $space.getName();
}
I think the space variables do not work within the PDF CSS…

Adeel
- 41
- 9
0
votes
1 answer
iTextSharp extract each character and getRectangle
I would like to parse an entire PDF character by character and be able to get the ASCII value, font and the Rectangle of that character on that PDF document which I can later use to save as a bitmap. I tried using PdfTextExtractor.GetTextFromPage…

amyn
- 922
- 11
- 24
0
votes
1 answer
ItextSharp anagram output when extract text from rectangle
i'm trying to extract text from a rectangle with ItextSharp, and it works fine with almost all the sections inside the document, except for some specific areas. These areas are simple bold caps titles and simple content with a slighter small font…

Mattia Biggi
- 3
- 2
0
votes
0 answers
iTextSharp returning ????? when extracting Text from PDF
I'm using ITextSharp with the follow command to extract text from pdf and it was working well. However today I received an different pdf and that resulted in extracting alot of ? ? ? ?.
Does anybody knows why that's happening? Is there anyway to at…

Felipe Santiago
- 414
- 6
- 16
0
votes
1 answer
How to extract text from PDF using PDFExtStream using Java
Text is not extracted from Sample.pdf file by using pdftextstream-2.6.3.jar
String filePath = "D:\\inbox\\temp\\Sample.pdf";
File document = new File(filePath);
StringBuffer pdfText = new StringBuffer(1024);
com.snowtide.pdf.OutputTarget tgt = new…

UdayKiran Pulipati
- 6,579
- 7
- 67
- 92