Improving accuracy in Python Tesseract OCR

Question

I am using pytesseract along with openCV in a simple django application in Python to extract text in Bengali language from image files. I have a form that lets you upload an image and on clicking the submit button sends it to the server side in an ajax call in jQuery to extract the text from the image to serve the purpose of OCR (Optical Character Recognition).

Template part :

 <div style="text-align: center;">
 <div id="result" class="text-center"></div>
    <form enctype="multipart/form-data" id="ocrForm" action="{% url 'process_image' %}" method="post"> <!-- Do not forget to add: enctype="multipart/form-data" -->
        {% csrf_token %}
        {{ form }}
        <button type="submit" class="btn btn-success">OCRzed</button>
    </form>

    <br><br><hr>
    <div id="content" style="width: 50%; margin: 0 auto;">
        
    </div>
</div>


<script type="text/javascript">




 $(document).ready(function(){ 
        function submitFile(){
            var fd = new FormData();
            fd.append('file', getFile())
            $("#result").html('<span class="wait">Please wait....</span>');

            $('#content').html('');
            $.ajax({
                url: "{% url 'process_image' %}",
                type: "POST",
                data: fd,
                processData: false,
                contentType: false,
                success: function(data){
                    // console.log(data.content);

            $("#result").html('');

                    if(data.content){
                        $('#content').html(
                            "<p>" + data.content + "</p>"
                        )
                    }  
                }
            })
        }

        function getFile(){
            var fp = $("#file_id")
            var item = fp[0].files
            return item[0]
        }

        // Submit the file for OCRization
        $("#ocrForm").on('submit', function(event){
            event.preventDefault();
            submitFile()
        })
    });






</script>

The urls.py file has:

from django.urls import path, re_path
from .views import *

urlpatterns = [
 path('process_image', OcrView.process_image, name='process_image') ,
]

The view part :

from django.contrib.auth.models import User
from django.shortcuts  import render, redirect, get_object_or_404
from .forms import NewTopicForm
from .models import Board, Topic, Post
from django.shortcuts import render
from django.http import HttpResponse
from django.http import Http404
    
from django.http import JsonResponse
from django.views.generic import FormView
    
from django.views.decorators.csrf import csrf_exempt
import json
import cv2
import numpy as np
    
import pytesseract    # ======= > Add
try:
     from PIL import Image
except:
        import Image

def ocr(request):
    return render(request, 'ocr.html')
    #    {'board': board,'form':form})    

# get grayscale image
def get_grayscale(image):
         return cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

# noise removal
def remove_noise(image):
         return cv2.medianBlur(image,5)
 
#thresholding
def thresholding(image):
         return cv2.threshold(image, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]

#dilation
def dilate(image):
         kernel = np.ones((5,5),np.uint8)
         return cv2.dilate(image, kernel, iterations = 1)
    
#erosion
def erode(image):
       kernel = np.ones((5,5),np.uint8)
       return cv2.erode(image, kernel, iterations = 1)

#opening - erosion followed by dilation
def opening(image):
        kernel = np.ones((5,5),np.uint8)
        return cv2.morphologyEx(image, cv2.MORPH_OPEN, kernel)

#canny edge detection
def canny(image):
        return cv2.Canny(image, 100, 200)

#skew correction
def deskew(image):
       coords = np.column_stack(np.where(image > 0))
       angle = cv2.minAreaRect(coords)[-1]
       if angle < -45:
         angle = -(90 + angle)
       else:
         angle = -angle
       (h, w) = image.shape[:2]
       center = (w // 2, h // 2)
       M = cv2.getRotationMatrix2D(center, angle, 1.0)
       rotated = cv2.warpAffine(image, M, (w, h), flags=cv2.INTER_CUBIC, borderMode=cv2.BORDER_REPLICATE)
       return rotated

#template matching
def match_template(image, template):
       return cv2.matchTemplate(image, template, cv2.TM_CCOEFF_NORMED)
 
class OcrView(FormView):
    form_class = UploadForm
    template_name = 'ocr.html'
    success_url = '/'

    
    @csrf_exempt
    def process_image(request):
        if request.method == 'POST':
          response_data = {}
          upload = request.FILES['file']
        
        filestr = request.FILES['file'].read()
        #convert string data to numpy array
        npimg = np.fromstring(filestr, np.uint8)
        image = cv2.imdecode(npimg, cv2.IMREAD_UNCHANGED)

        # image=Image.open(upload)
        gray = get_grayscale(image)
        thresh = thresholding(gray)
        opening1 = opening(gray)
        canny1 = canny(gray)
       
        pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
        # content = pytesseract.image_to_string(Image.open(upload), lang = 'ben')

        # content = pytesseract.image_to_string( image, lang = 'ben')

        content = pytesseract.image_to_string( image, lang = 'eng+ben')

        #   data_ben = process_image("test_ben.png", "ben")
        response_data['content'] = content

        return JsonResponse(response_data)

I am attaching a sample image just below here which when I give as the input file, the extracted text I get from there is not up to any satisfactory level of accuracy. The input image is:

I am attaching a screenshot of the extracted text with wrong words underlined in red below. Note that the spaces and indentations are not preserved there. The screenshot of extracted text is :

In the above code snippet, I have done the image processing with the following code lines:

gray = get_grayscale(image)
thresh = thresholding(gray)
opening1 = opening(gray)
canny1 = canny(gray)

After that I have fed tesserect with the processed image in the following line:

content = pytesseract.image_to_string( image, lang = 'eng+ben')

But my point of confusion is that I have nowhere saved the image before or after processing. So when I use the above line , I am not sure whether the processed or unprocessed image is supplied to tesserect engine.

Q1) Do I need to save the image after processing it and then supply it to the tesserect engine ? If yes , how to do that ?

Q2) What else steps should I take to improve the accuracy ?

NB: Even if you are not familiar with Bengali language, I think this wont be any problem as you can just look at the red-underlined words and make a comparison.

EDIT:

TL;DR: You can just look at the code in view.py and urls.py files and exclude the template code for the sake of understanding easily.

Did you find the solution? – Gary Chen Apr 26 '21 at 13:56 — Gary Chen, Apr 26 '21 at 13:56

score 1 · Answer 1 · answered Nov 20 '21 at 20:36

Q1) No need to save the image. The image is stored in your variable image

Q2) You are not actually doing OCR on the image post-processing functions applied to, i.e. variable canny1. The below code would successively perform the processing steps on image and then apply OCR to the post-processed image stored in canny1.

gray = get_grayscale(image)
thresh = thresholding(gray)
opening1 = opening(thresh )
canny1 = canny(opening1 )

content = pytesseract.image_to_string( canny1 , lang = 'eng+ben')

Improving accuracy in Python Tesseract OCR

1 Answers1

Linked