Questions tagged [pdf-extraction]

Extracting text and other data from a PDF document, regardless of the libraries used to achieve this.

148 questions
0
votes
3 answers

Hyperlink Detection from PDF

I have some PDFs containing Hyperlinks both in form of URL and mailto. Now Is there any way or tool(may be 3rd party) to extract the Hyperlink meta information form the PDF like coordinates, link type and destination address. Any help is highly…
Tech Enthusiast
  • 279
  • 1
  • 5
  • 18
0
votes
2 answers

python - pull pdfs from webpage and convert to html

My goal is to have a python script that will access particular webpages, extract all pdf files on each page that have a certain word in their filename, convert them into html/xml, then go through the html files to read data from the pdfs' tables. So…
maniciam
  • 365
  • 5
  • 10
0
votes
1 answer

how to make existing pdf editable? Android app

I'm making an app in android.I'm able to write text and create a new PDFs and also can read existing PDFs. But i'm not getting solution to edit the existing PDFs. Editing the PDFs in my app is the target which i have to achieve. I tried to convert…
Yushi
  • 416
  • 6
  • 24
0
votes
1 answer

Perl error - cant call the "getPageContent" on undefined value?

Hi im trying to extract the content of pdf file but im facing the above problem my code is use PDF; use CAM::PDF; use CAM::PDF::PageText; my $file = "s.pdf"; my $pdf = CAM::PDF->new($file); my $pageone_tree = $pdf->getPageContent(1); print…
backtrack
  • 7,996
  • 5
  • 52
  • 99
0
votes
0 answers

pdfbox does not show even if I select check box in code

I am using PDFbox java api to fill out the values in PDF. I can fill the textbox values. When I use check() method for checkboxes as shown How to check a check box in PDF-form using Java PDFBOX api it set value to true in background but that does…
user583726
  • 647
  • 2
  • 9
  • 20
0
votes
1 answer

export a pdf file from powerpoint with vba

I want to be able to export the PDF files that I insert in my powerpoint presentation using vba. I know that you can add .zip extension to the pptx file (just modifying the name of the file) and then check the content of the presentation. It works…
Iban Arriola
  • 2,526
  • 9
  • 41
  • 88
0
votes
1 answer

iOS getting text from pdf

Hello i'm working on a speedreading app and i'm looking for some tips or suggestions. In this app i have to use different reading techniques this requires formatting the text in different sizes from a pdf. for techniques as auto scrolling without…
ddnl
  • 495
  • 1
  • 6
  • 22
0
votes
2 answers

Best way to get a database friendly list of Veteran Affairs Hospital

I sincerely apologize if this isn't the proper forum to discuss this, but I wasn't sure where to go or what would be the best option. Basically, I'm trying to find a database friendly list of veteran affairs hospitals. The closest thing that I've…
AJ Tatum
  • 653
  • 2
  • 15
  • 35
0
votes
1 answer

Extracting correctly the text from a pdf (UTF-8)

I want to extract text from some pdf files (programmatically, with some utility or even with copy/paste) but some characters are coming out really strange. Although I specify UTF-8 encoding when extracting the text, characters like "ș, ț, ă," etc…
Andrei F
  • 4,205
  • 9
  • 35
  • 66
-1
votes
1 answer

How to extract a table without all borders into text with Python?

I am trying to extract a table like this into a Dataframe. How to do that (and extract even the names splitted on several lines) with Python? Also, I want this to be general and to be applied on each table (even if it doesn't this structure), so…
-1
votes
1 answer

How to make and train a Model which read data after extracting pdf

Here i share my code main.py from fitz import fitz import spacy location = "D:\python\Resume-Sample.pdf" text = '' with fitz.open(str(location)) as doc: for page in doc: text+=page.get_text("block") NER =…
-1
votes
1 answer

How to distinguish uploaded PDFs to extract data through regular expression in python Django

Here are uploaded pdfs and it will convert it into text. After converting into text I use a regular expression to get some specific data from the pdfs. Now there are various kinds of pdfs and I have to use different types of regular expression for…
zenvar
  • 19
  • 8
-2
votes
1 answer

Regex Expressions For Different PDF's

I'm trying to parse some PDF's, extract the tabular data and output them into JSON files. I'm using regex to search for column values under "Account" and "Allocations". What regex should I use instead? It needs to be general enough to work for all…
Rabiya
  • 23
  • 5
1 2 3
9
10