How do I Make a PDF searchable for a flask search application?

Question

I have been doing research for a very important personal project. I would like to create a Flask Search Application that allows me to search for content across 100 Plus PDF files. I have found Some information around A ElasticSearch Lib that works well with flask.

#!/usr/bin/env python3
#-*- coding: utf-8 -*-

# import libraries to help read and create PDF
import PyPDF2
from fpdf import FPDF
import base64
import json
from flask import Flask, jsonify, request, render_template,  json
from datetime import datetime
import pandas as pd

# import the Elasticsearch low-level client library
from elasticsearch import Elasticsearch
# create a new client instance of Elasticsearch
elastic_client = Elasticsearch(hosts=["localhost"])
es = Elasticsearch("http://localhost:9200/")
app = Flask(__name__)

# create a new PDF object with FPDF
pdf = FPDF()

# use an iterator to create 10 pages
for page in range(10):
    pdf.add_page()
    pdf.set_font("Arial", size=14)
    pdf.cell(150, 12, txt="Object Rocket ROCKS!!", ln=1, align="C")

# output all of the data to a new PDF file
pdf.output("object_rocket.pdf")

'''
read_pdf = PyPDF2.PdfFileReader("object_rocket.pdf")
page = read_pdf.getPage(0)
page_mode = read_pdf.getPageMode()
page_text = page.extractText()
print (type(page_text))
'''
#with open(path, 'rb') as file:

# get the PDF path and read the file
file = "Sheet3.pdf"
read_pdf = PyPDF2.PdfFileReader(file, strict=False)
#print (read_pdf)

# get the read object's meta info
pdf_meta = read_pdf.getDocumentInfo()

# get the page numbers
num = read_pdf.getNumPages()
print ("PDF pages:", num)

# create a dictionary object for page data
all_pages = {}

# put meta data into a dict key
all_pages["meta"] = {}

# Use 'iteritems()` instead of 'items()' for Python 2
for meta, value in pdf_meta.items():
    print (meta, value)
    all_pages["meta"][meta] = value

# iterate the page numbers
for page in range(num):
    data = read_pdf.getPage(page)
    #page_mode = read_pdf.getPageMode()

    # extract the page's text
    page_text = data.extractText()

    # put the text data into the dict
    all_pages[page] = page_text

# create a JSON string from the dictionary
json_data = json.dumps(all_pages)
#print ("\nJSON:", json_data)

# convert JSON string to bytes-like obj
bytes_string = bytes(json_data, 'utf-8')
#print ("\nbytes_string:", bytes_string)

# convert bytes to base64 encoded string
encoded_pdf = base64.b64encode(bytes_string)
encoded_pdf = str(encoded_pdf)
#print ("\nbase64:", encoded_pdf)

# put the PDF data into a dictionary body to pass to the API request
body_doc = {"data": encoded_pdf}

# call the index() method to index the data
result = elastic_client.index(index="pdf", doc_type="_doc", id="42", body=body_doc)

# print the returned sresults
#print ("\nindex result:", result['result'])

# make another Elasticsearch API request to get the indexed PDF
result = elastic_client.get(index="pdf", doc_type='_doc', id=42)

# print the data to terminal
result_data = result["_source"]["data"]
#print ("\nresult_data:", result_data, '-- type:', type(result_data))

# decode the base64 data (use to [:] to slice off
# the 'b and ' in the string)
decoded_pdf = base64.b64decode(result_data[2:-1]).decode("utf-8")
#print ("\ndecoded_pdf:", decoded_pdf)

# take decoded string and make into JSON object
json_dict = json.loads(decoded_pdf)
#print ("\njson_str:", json_dict, "\n\ntype:", type(json_dict))
result2 = elastic_client.index(index="pdftext", doc_type="_doc", id="42", body=json_dict)

# create new FPDF object
pdf = FPDF()

# build the new PDF from the Elasticsearch dictionary
# Use 'iteritems()` instead of 'items()' for Python 2
""" for page, value in json_data:
    if page != "meta":
        # create new page
        pdf.add_page()
        pdf.set_font("Arial", size=14)

        # add content to page
        output = value + " -- Page: " + str(int(page)+1)
        pdf.cell(150, 12, txt=output, ln=1, align="C")
    else:
        # create the meta data for the new PDF
        for meta, meta_val in json_dict["meta"].items():
            if "title" in meta.lower():
                pdf.set_title(meta_val)
            elif "producer" in meta.lower() or "creator" in meta.lower():
                pdf.set_creator(meta_val)
 """
# output the PDF object's data to a PDF file
#pdf.output("object_rocket_from_elaticsearch.pdf" )

@app.route('/', methods=['GET'])
def index():

    return jsonify(json_dict)

@app.route('/<id>', methods=['GET'])
def index_by_id(id):

    return jsonify(json_dict[id])


""" @app.route('/insert_data', methods=['PUT'])
def insert_data():
    slug = request.form['slug']
    title = request.form['title']
    content = request.form['content']

    body = {
        'slug': slug,
        'title': title,
        'content': content,
        'timestamp': datetime.now()
    }

    result = es.index(index='contents', doc_type='title', id=slug, body=body)

    return jsonify(result) """



app.run(port=5003, debug=True)

------Progress------ I now have a working solution with no front-end search capability:

# Load_single_PDF_BY_PAGE_TO_index.py
  #!/usr/bin/env python3
#-*- coding: utf-8 -*-

# import libraries to help read and create PDF
import PyPDF2
from fpdf import FPDF
import base64

from flask import Flask, jsonify, request, render_template,  json
from datetime import datetime
import pandas as pd

# import the Elasticsearch low-level client library
from elasticsearch import Elasticsearch
# create a new client instance of Elasticsearch
elastic_client = Elasticsearch(hosts=["localhost"])
es = Elasticsearch("http://localhost:9200/")
app = Flask(__name__)


#with open(path, 'rb') as file:

# get the PDF path and read the file
file = "Sheet3.pdf"
read_pdf = PyPDF2.PdfFileReader(file, strict=False)
#print (read_pdf)

# get the read object's meta info
pdf_meta = read_pdf.getDocumentInfo()

# get the page numbers
num = read_pdf.getNumPages()
print ("PDF pages:", num)

# create a dictionary object for page data
all_pages = {}

# put meta data into a dict key
all_pages["meta"] = {}

# Use 'iteritems()` instead of 'items()' for Python 2
for meta, value in pdf_meta.items():
    print (meta, value)
    all_pages["meta"][meta] = value

x = 44
# iterate the page numbers
for page in range(num):
    data = read_pdf.getPage(page)
    #page_mode = read_pdf.getPageMode()

    # extract the page's text
    page_text = data.extractText()

    # put the text data into the dict
    all_pages[page] = page_text

    body_doc2 = {"data": page_text}
    result3 = elastic_client.index(index="pdfclearn", doc_type="_doc", id=x, body=body_doc2)
    x += 1

The above code loads a single pdf into elasticsearch by page.

from flask import Flask, jsonify, request,render_template
from elasticsearch import Elasticsearch
from datetime import datetime
es = Elasticsearch("http://localhost:9200/")

app = Flask(__name__)

@app.route('/pdf', methods=['GET'])
def index():
    results = es.get(index='pdfclearn', doc_type='_doc', id='44')
    return jsonify(results['_source'])


@app.route('/pdf/<id>', methods=['GET'])
def index_by_id(id):
    results = es.get(index='pdfclearn', doc_type='_doc', id=id)
    return jsonify(results['_source'])



@app.route('/search/<keyword>', methods=['POST','GET'])
def search(keyword):
    keyword = keyword

    body = {
        "query": {
            "multi_match": {
                "query": keyword,
                "fields": ["data"]
            }
        }
    }

    res = es.search(index="pdfclearn", doc_type="_doc", body=body)

    return jsonify(res['hits']['hits'])

@app.route("/searhbar")
def searhbar():
    return render_template("index.html")

@app.route("/searhbar/<string:box>")
def process(box):
    query = request.args.get('query')
    if box == 'names':
         keyword = box

    body = {
        "query": {
            "multi_match": {
                "query": keyword,
                "fields": ["data"]
            }
        }
    }

    res = es.search(index="pdfclearn", doc_type="_doc", body=body)

    return jsonify(res['hits']['hits'])

app.run(port=5003, debug=True)

In the above code we can search across all Pages for a keyword or phrase.

curl http://127.0.0.1:5003/search/test //it works!!

I Found a blog about how to dave PDF files as a Base64 index in ElasticSearch. I have seen DocuSign's API do this for document templating. However, I dont understand How to Jsonify the Base64 PDF in a way thats searchable for ElasticSearch.

curl "http://localhost:9200/pdftext/_doc/42"

curl -X POST "http://localhost:9200/pdf/_search?q=*"

I can retrieve the Base64 of a 700 Page document. But I think what I need is to Index and retrieve Each Page of the Document.

Blogs I Have Studied that got me part the way:

endgame:

https://towardsdatascience.com/create-a-full-search-engine-via-flask-elasticsearch-javascript-d3js-and-bootstrap-275f9dc6efe1

I will continue to study Elastic Search and Base64 Encoding and decoding. But I would like some help getting to my goal. Any Detailed example would be much appreciated.

Found a Lib for Python, Whoosh: https://whoosh.readthedocs.io/en/latest/intro.html. I will try a new approach with this lib next — BlackFox, Feb 08 '20 at 22:46
Now testing a lib for python, Scout: https://scout.readthedocs.io/en/latest/installation.html — BlackFox, Feb 09 '20 at 06:56

score 0 · Answer 1 · answered Feb 04 '20 at 21:59

------Progress------ I now have a working solution with no front-end search capability:

# Load_single_PDF_BY_PAGE_TO_index.py
  #!/usr/bin/env python3
#-*- coding: utf-8 -*-

# import libraries to help read and create PDF
import PyPDF2
from fpdf import FPDF
import base64

from flask import Flask, jsonify, request, render_template,  json
from datetime import datetime
import pandas as pd

# import the Elasticsearch low-level client library
from elasticsearch import Elasticsearch
# create a new client instance of Elasticsearch
elastic_client = Elasticsearch(hosts=["localhost"])
es = Elasticsearch("http://localhost:9200/")
app = Flask(__name__)


#with open(path, 'rb') as file:

# get the PDF path and read the file
file = "Sheet3.pdf"
read_pdf = PyPDF2.PdfFileReader(file, strict=False)
#print (read_pdf)

# get the read object's meta info
pdf_meta = read_pdf.getDocumentInfo()

# get the page numbers
num = read_pdf.getNumPages()
print ("PDF pages:", num)

# create a dictionary object for page data
all_pages = {}

# put meta data into a dict key
all_pages["meta"] = {}

# Use 'iteritems()` instead of 'items()' for Python 2
for meta, value in pdf_meta.items():
    print (meta, value)
    all_pages["meta"][meta] = value

x = 44
# iterate the page numbers
for page in range(num):
    data = read_pdf.getPage(page)
    #page_mode = read_pdf.getPageMode()

    # extract the page's text
    page_text = data.extractText()

    # put the text data into the dict
    all_pages[page] = page_text

    body_doc2 = {"data": page_text}
    result3 = elastic_client.index(index="pdfclearn", doc_type="_doc", id=x, body=body_doc2)
    x += 1

The above code loads a single pdf into elasticsearch by page.

from flask import Flask, jsonify, request,render_template
from elasticsearch import Elasticsearch
from datetime import datetime
es = Elasticsearch("http://localhost:9200/")

app = Flask(__name__)

@app.route('/pdf', methods=['GET'])
def index():
    results = es.get(index='pdfclearn', doc_type='_doc', id='44')
    return jsonify(results['_source'])


@app.route('/pdf/<id>', methods=['GET'])
def index_by_id(id):
    results = es.get(index='pdfclearn', doc_type='_doc', id=id)
    return jsonify(results['_source'])



@app.route('/search/<keyword>', methods=['POST','GET'])
def search(keyword):
    keyword = keyword

    body = {
        "query": {
            "multi_match": {
                "query": keyword,
                "fields": ["data"]
            }
        }
    }

    res = es.search(index="pdfclearn", doc_type="_doc", body=body)

    return jsonify(res['hits']['hits'])

@app.route("/searhbar")
def searhbar():
    return render_template("index.html")

@app.route("/searhbar/<string:box>")
def process(box):
    query = request.args.get('query')
    if box == 'names':
         keyword = box

    body = {
        "query": {
            "multi_match": {
                "query": keyword,
                "fields": ["data"]
            }
        }
    }

    res = es.search(index="pdfclearn", doc_type="_doc", body=body)

    return jsonify(res['hits']['hits'])

app.run(port=5003, debug=True)

In the above code we can search across all Pages for a keyword or phrase.

curl http://127.0.0.1:5003/search/test //it works!!

score 0 · Answer 2 · answered Feb 17 '20 at 06:39

So i found a lib called scout and...got it to work!

from scout_client import Scout

# import libraries to help read and create PDF

import PyPDF2
from fpdf import FPDF
import base64
import os
from flask import Flask, jsonify, request, render_template,  json

client = Scout('http://localhost:8000')

for k in range(7,18):
    read_pdf = PyPDF2.PdfFileReader("books/%s.pdf"%(k))
    num = read_pdf.getNumPages()
    print ("PDF pages:", num)
    all_pages = []
    for page in range(num):
        data = read_pdf.getPage(page) 
        page_text = data.extractText()  
        all_pages.append(page_text)

    import requests
    for z in all_pages:
        url = 'http://localhost:8000/documents/'
        data = {'content': z, 'indexes': ['test13']}
        headers = {
        'Content-Type': 'application/json',
        }

        response = requests.post(url, data=json.dumps(data), headers=headers)

    print(response)

I can now loop though as many PDF's as I want locally
Post to the server for indexing
and search for keywords

Now I just need help with Making a basic front end with a search bar that calls data from a JSON response in python and flask.

score 0 · Accepted Answer · edited Aug 16 '20 at 11:58

0

So now Amazon has a solution for my use case. It's called AWS Textract. If you create a free AWS account, and download the Cli and Python sdk, you can use the following code:

import boto3

# Document
documentName = "test2-28.png"

# Read document content
with open(documentName, 'rb') as document:
    imageBytes = document.read()

# Amazon Textract clientls
textract = boto3.client('textract')

# Call Amazon Textract
response = textract.detect_document_text(Document={'Bytes': imageBytes})


# print(response)

# Print detected text
for item in response["Blocks"]:
    if item["BlockType"] == "LINE":
        print('\033[94m' + item["Text"] + '\033[0m')

Make sure to convert your PDF pages to Images first. ML works off Images. I used .png files for each page. Next I will need to loop through a folder with all pages as images in it. I will also need to save to a CSV file output or DB for future analysis.

edited Aug 16 '20 at 11:58

halfer

19,824
17
99
186

answered Aug 06 '20 at 00:10

BlackFox

773
1
8
24

1

is it possible to create a searchable PDF from what AWS Textract returns (using python)? Amazon has shared java code here but I need it to be python as that is what I am somewhat familiar with. The link to java code here -- https://aws.amazon.com/blogs/machine-learning/generating-searchable-pdfs-from-scanned-documents-automatically-with-amazon-textract/ Thanks!! – jim70 Feb 17 '21 at 02:32
@jim70 explain how in detail and post and ill give you the right answer credit here. I did explore the solution and one point after the project was over...I do think its the best solution on the market right now..but it costs. – BlackFox Feb 18 '21 at 18:33
I wish I could. I mean will have to learn Java first (to translate what AWS has shown can be done in Java). But I have learnt that what Amazon returns, or what Google Vision API can return can be converted in hocr format, and then the hocr format and the original PDF can be merged using python libraries. But I have not been able to do so. Wish I could. – jim70 Feb 19 '21 at 20:17
Updated [Looks like GCP has a great solution for PDF's now: https://cloud.google.com/vision/docs/pdf][1] [1]: https://cloud.google.com/vision/docs/pdf it is completable what im looking for and has examples in many lang:) – BlackFox Aug 13 '22 at 22:32

score -1 · Answer 4 · answered Feb 04 '20 at 13:22

-1

Try this - https://www.elastic.co/guide/en/elasticsearch/reference/6.8/binary.html

use store=true for this datatype as it does not store data nor allow search by default.

answered Feb 04 '20 at 13:22

Pankhuri Agarwal

764
3
23

Did u try this method? – Pankhuri Agarwal Feb 17 '20 at 06:15
Success? Failure? Time Invested? – Pankhuri Agarwal Feb 17 '20 at 06:17
If u use elasticsearch, link to the docs is the easiest things! – Pankhuri Agarwal Feb 17 '20 at 06:27
Your not suppose to do that on stackoverflow. You have to outline in detail a solution in context. Ur not allowed to just show a link lol. I know it sucks but it help the community...like with my question here. I needed an example on how to use what was in ur link and how to create a loop through pages in a pdf. In my research I found a lib called PYPDF2.....see how I use more words over saying: here’s a link to one of the dependencies ur using in your code :) – BlackFox Feb 17 '20 at 06:42

How do I Make a PDF searchable for a flask search application?

4 Answers4