Highest Voted 'pdf-extraction' Questions

0

votes

0 answers

Generating ToUnicode CMaps (Programmatically or Visually)

I have several problematic PDFs, which I am attempting to convert to PDF/A-1a. These documents utilize CID Identity-H embedded subsets, generated with Acrobat Distiller 20.0. I have performed searches for tools which could utilize OCR to scan the…

asked Apr 15 '20 at 04:14

Kadaj Nakamura

923
1
10
24

0

votes

1 answer

Finding text coordinates using bytescout PDFExtractor C#

I have a PDF that I need to find and replace some text. I know how to create overlays and add text but I can't determine how to locate the current text coordinates. This is the example I found on the bytescout site - // Create…

c# pdf coordinates pdf-extraction

asked Apr 03 '20 at 16:34

KarkMump

81
1
8

0

votes

0 answers

Error during change current work directory

When run the code bellow the line os.chdir(folder_path) return a error. What's wrong? That's my folder hierarchy: -data -NotaCorretagem_60076_20181009.pdf -output -report -script -data_extraction.py My data_extraction.py file code: # import…

python chdir pdf-extraction

asked Oct 29 '19 at 10:34

Elsior Moreira Alves Junior

43
7

0

votes

0 answers

Data Extraction using tika library

Requirement is to pasre pdf and document file. How to parse only required page for example in a doc / pdf file there are 10 pages. But requirement is to parse only Page 1 -3 and Last page.

java apache-tika pdf-extraction

asked Sep 16 '19 at 11:56

Santosh Singh

11
2

0

votes

0 answers

Extract data from pdf boxes in R

PDF has boxes with data. I want to extract all the data from these boxes in R. I want this to be extracted without using OCR. I have tried Tabulizer package but it is giving unorganized results making it impossible to extract. report <-…

r pdf-extraction tabulizer pdftables

asked Jul 25 '19 at 10:48

Dinesh Mandal

23
3

0

votes

1 answer

Passing a pdf file to a function when it requires a path or link

I am working on a web application for an online library. I want to extract metadata from the PDF's that will be uploaded and for that I am using the nodejs library pdf.js-extract and multer-gridfs-storage for the upload. The problem is that I am…

javascript node.js pdf.js pdf-extraction multer-gridfs-storage

asked May 07 '19 at 09:03

Luis de la Cal

41
8

0

votes

0 answers

Using PDFMiner.six Python3 prints weird characters to file

I am currently working with PDFMiner.six to extract text from multiple PDFs. Looking at my output I can see that I get some weird conversions of special characters like brakets: Opening and closing brackets: Finally, I delete all paragraphs 共deﬁned…

python unicode utf-8 character-encoding pdf-extraction

asked May 06 '19 at 09:38

Florian Schramm

333
3
15

0

votes

2 answers

How can I print the tables in a .pdf file using python

CalledProcessError: Command '['java', '-Dfile.encoding=UTF8', '-jar', 'C:\Users\vijv2c13136\AppData\Local\Continuum\anaconda2\lib\site-packages\tabula\tabula-1.0.2-jar-with-dependencies.jar', '--pages', 'all', '--guess', '--format', 'JSON',…

python pdf-extraction

asked Dec 13 '18 at 06:18

A.Viji

23
2
7

0

votes

1 answer

How to use page.filter(test_function) in PDFPlumber library?

I am trying to delete tables inside the a pdf page and I'm trying to use page.filter() function for that, here I have table bbox coordinates and I am trying to compare if object coordinates are inside the table coordinates or not. But I was unable…

python pdf pdf-parsing pdf-extraction

asked Nov 03 '18 at 08:30

Satyaaditya

537
8
26

0

votes

2 answers

Tabula CalledProcessError: returned non-zero exit status 2. Tried everything possible

I keep getting this error while using Tabula on python. I've gone through EVERY stackoverflow question related to this and blogs as well. My JDK JRE is up to date. java version "1.8.0_161" Java(TM) SE Runtime Environment (build 1.8.0_161-b12) Java…

python tabula pdf-extraction

asked Oct 04 '18 at 05:22

Pai

1
5

0

votes

1 answer

Confluence: Is there a way to use space variables in Global PDF Stylesheet? Or somehow include it on PDF Exports

For PDF exports, I'm trying to export the space name to the bottom center of the export. I tried the following but no luck so far: @bottom-center { content: $space.getName(); } I think the space variables do not work within the PDF CSS…

css pdf confluence pdf-extraction

asked Mar 17 '18 at 19:15

Adeel

41
9

0

votes

1 answer

iTextSharp extract each character and getRectangle

I would like to parse an entire PDF character by character and be able to get the ASCII value, font and the Rectangle of that character on that PDF document which I can later use to save as a bitmap. I tried using PdfTextExtractor.GetTextFromPage…

itext pdf-extraction

asked Jan 21 '16 at 07:17

amyn

922
11
24

0

votes

1 answer

ItextSharp anagram output when extract text from rectangle

i'm trying to extract text from a rectangle with ItextSharp, and it works fine with almost all the sections inside the document, except for some specific areas. These areas are simple bold caps titles and simple content with a slighter small font…

vb.net pdf itext pdf-extraction

asked Jan 12 '16 at 10:41

Mattia Biggi

3
2

0

votes

0 answers

iTextSharp returning ????? when extracting Text from PDF

I'm using ITextSharp with the follow command to extract text from pdf and it was working well. However today I received an different pdf and that resulted in extracting alot of ? ? ? ?. Does anybody knows why that's happening? Is there anyway to at…

c# pdf itext pdf-extraction

asked Aug 26 '15 at 20:59

Felipe Santiago

414
6
16

0

votes

1 answer

How to extract text from PDF using PDFExtStream using Java

Text is not extracted from Sample.pdf file by using pdftextstream-2.6.3.jar String filePath = "D:\\inbox\\temp\\Sample.pdf"; File document = new File(filePath); StringBuffer pdfText = new StringBuffer(1024); com.snowtide.pdf.OutputTarget tgt = new…

java pdf pdf-extraction pdftextstream snowtide

asked Jan 07 '15 at 11:22

UdayKiran Pulipati

6,579
7
67
92

Questions tagged [pdf-extraction]