AWS Textract to create searchable PDF - looking for python code

Question

I would like to extract handwritten text from a scanned image - using say Amazon AWS Textract. And then would like to be able to create a searchable PDF with the output - so convert the image into a pdf with a text layer.

Amazon has provided a blog post and java code showing how it can be done.

Blog post - Link
Java Code - Link

Would like to be able to do it in Python. Python code examples showing AWS Textract usage are all here - link.

However, these examples do not show how to use the response from AWS Textract and create a searchable PDF. Has anybody written code for that last step - to create searchable PDF with Textract response?

Thank you.

Creating a PDF from text that you extracted from an image is not something that AWS Textract or other AWS services can do for you. Use the typical Python libraries to do this, for example [PyPDF2.PdfFileWriter](https://pythonhosted.org/PyPDF2/PdfFileWriter.html). — jarmod, Feb 17 '21 at 03:36
@jarmod ok - got it. Will work on figuring out how to use PdfFileWriter when I next go down this path. Thank you! — jim70, Mar 14 '21 at 01:01

score 1 · Answer 1 · answered Apr 19 '23 at 12:38

1

Here is an aws-sample repo using Python to create searchable pdfs similar to the Java link you posted.

answered Apr 19 '23 at 12:38

tbrk

173
1
8

thank you for sharing. will try to come to this when I take it on again. for now, on back burner :) – jim70 Apr 20 '23 at 13:53

AWS Textract to create searchable PDF - looking for python code

1 Answers1