Questions tagged [pdf-extraction]

Extracting text and other data from a PDF document, regardless of the libraries used to achieve this.

148 questions
2
votes
0 answers

How to extract pdf pages which are more than 2000 characters per page using tika parser in python?

I want to extract the pdf pages which are more than 2000 chars per page using tika parser in python. From the below code I have extracted the [metadata] and from which I have used pdf:charsPerPage to get the minimum chars limit per page (as 2000).…
2
votes
2 answers

Extract text in a rectangle from pdf - Python

I have a requirement that to extract a text which in a rectangle from Pdf. There are several methods I have tested. But not getting specific text. For example I tested with PyMuPDF, pdfplumber, tabula, camelot, pdftables packages. In PyMuPDF module…
Kamaal Shaik
  • 57
  • 1
  • 9
2
votes
0 answers

how to extract outline of pdfs from pdf bundle file and write it to csv file using bash script or node.js

I have a pdf bundle,i need to extract outline name of each pdf and write it to csv file using bash script or node.js.I am using pdftk library in bash script.In bash script i have use this command pdftk input.pdf burst output…
Sherin Green
  • 308
  • 1
  • 3
  • 18
2
votes
1 answer

I want to upload a file locally then upload that file to S3. However Multer only allows one or the other at a time

My goal is the following: I want to get user uploaded PDF, extract the Text from within that PDF, assign the text to an array object. Once this is done I want to upload that file to an S3 bucket. Right now I am able to to do the first part without…
Dan Bee
  • 21
  • 1
2
votes
2 answers

Python-Camelot extracting empty tables

I am using Camelot to extract multiple sections of a PDF by the following command. cgl_section = camelot.read_pdf(filename, flavor='stream', table_areas=['35,490,155,483', '53,480,110,470', '117,480,155,470', …
A.A. F
  • 349
  • 5
  • 16
2
votes
0 answers

How to convert a PDF image or an image to text using Tesseract and/or Poppler?

Python 3.6.1 Mac OSX Regarding Tesseract, I have tried so many different sample/template codes I have found online for PDF -> Text and Image -> Text. None of them seem to work. Please let me know if you know of a code that works or a website with a…
gmonz
  • 252
  • 1
  • 5
  • 17
2
votes
0 answers

Extract text from a PDF email attachment without saving the attachment to a pdf file first

I'm using PDF Extractor (from here) to get the text from PDF attachments in emails. It seems to me that the only way I can extract the text is to save the PDF to a file, and then using the code. Private Function ReadPdfToStringList(tempfilename As…
David Wilson
  • 4,369
  • 3
  • 18
  • 31
2
votes
1 answer

How to extract a paragraph from a pdf file and store its position?

I'm going to extract the content of a PDF file using PDFBox library. The content should be processed paragraph-by-paragraph and for each paragraph, I need its position for follow-up processing. Using the following code, I can extract the whole…
AmirHJ
  • 827
  • 1
  • 11
  • 21
2
votes
0 answers

RUBY pdf-extract gem to extract references from scholarly article does not work?

I am a newbie on both ruby and its pdf-extract gem. After installing ruby for 64-bit and installing related Development Kit, I have installed pdf-extract with the code below: gem install pdf-extract By checking the quick examples from the web site…
mlee_jordan
  • 772
  • 4
  • 18
  • 50
1
vote
1 answer

Azure Form Intelligence Connected Container Setup

We have a requirement for pdf parsing and planning to use azure form intelligence . Since our client has sensitive information we don't want to send our data to Azure instead we will be using Form intelligence connected containers and will be…
1
vote
3 answers

Extract author names in the PDF using Python

I have multiple PDF files where I need to extract the author names. I need to extract the author names only from the first page of the PDF file and ignore all the other pages. I have multiple PDF files which the same format where I need to extract…
merkle
  • 1,585
  • 4
  • 18
  • 33
1
vote
1 answer

Filter text in PDF by font with Borb using regex

I am trying to extract text using Borb from a PDF and i can see there is a clear example to extract text with font names: # create FontNameFilter l0: FontNameFilter = FontNameFilter("Helvetica") # filtered text just gets passed to…
IamButtman
  • 307
  • 3
  • 15
1
vote
1 answer

How to use Camelot-py to split rows when text exist on a specific column

I am trying to extract table information from pdf using Camelot-py library. Initially using stream function like this: import camelot tables = camelot.read_pdf('sample.pdf', flavor='stream', pages='1', columns=['110,400'], split_text=True,…
1
vote
0 answers

Camelot pdf extraction has an issue while copying texts among span cells

I am extracting data from PDFs using camelot and am faced with the following issue on 3. page of this datasheet. The problematic table is shown below: The issue is inconsistency during the copying content of span cells. As you can see on the…
Said Akyuz
  • 180
  • 1
  • 1
  • 11
1
vote
1 answer

Python extract text between two tables as title for the table(outside tables) from pdf with tabula

I am trying to extract tables from a pdf files, after trying with multiple different packages, tabula is the best one to extract the tables from my pdf file correctly. The thing is that, for each table, there is a title for it above the table (not…
user15410844
  • 61
  • 1
  • 7
1 2
3
9 10