-2

I would like to remove gridlines from a scanned document using Python to make them easier to read.

Here is a snippet of what we're working with:

An example

As you can see, there are inconsistencies in the grid, and to make matters worse the scanning isn't always square. Five example documents can be found here.

I am open to whatever methods you may suggest for this, but using openCV and pypdf might be a good place to start before any more involved breaking out the machine learning techniques.

This post addresses a similar question, but does not have a solution. The user posted the following code snippet which may be of interest (to be honest I have not tested it, I am just putting it here for your convivence).

import cv2
import numpy as np

def rmv_lines(Image_Path):
    img = cv2.imread(Image_Path)
    gray = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)
    edges = cv2.Canny(gray,50,150,apertureSize = 3)
    minLineLength, maxLineGap = 100, 15
    lines = cv2.HoughLinesP(edges,1,np.pi/180,100,minLineLength,maxLineGap)
    for x in range(0, len(lines)):
        for x1,y1,x2,y2 in lines[x]:
        #if x1 != x2 and y1 != y2:
            cv2.line(img,(x1,y1),(x2,y2),(255,255,255),4)
    return cv2.imwrite('removed.jpg',img)

I would prefer the final documents be in pdf format if possible.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Alex Long
  • 55
  • 4
  • Please be aware this is not a code-writing or tutoring service. We can help solve specific, technical problems, not open-ended requests for code or advice. Please edit your question to show what you have tried so far, and what specific problem you need help with. See the [How To Ask a Good Question](https://stackoverflow.com/help/how-to-ask "How To Ask a Good Question") page for details on how to best help us help you. – itprorh66 Feb 25 '21 at 00:02
  • Fair enough, should I delete this post? I am rather well versed in python, but not in image processing, so I was hoping I could get some assistance along those lines. – Alex Long Feb 25 '21 at 00:17

1 Answers1

1

(disclaimer: I am the author of pText, the library being used in this answer)

I can help you part of the way (extracting the images from the PDF).

Start by loading the Document. You'll see that I'm passing an extra parameter in the PDF.loads method. SimpleImageExtraction acts like an EventListener for PDF instructions. Whenever it encounters an instruction that would render an image, it intercepts the instruction and stores the image.

with open(file, "rb") as pdf_file_handle:
    l = SimpleImageExtraction()
    doc = PDF.loads(pdf_file_handle, [l])

Now that we have loaded the Document, and SimpleImageExtraction should have had a chance to work its magic, we can output the images. In this example I'm just going to store them.

    for i, img in enumerate(l.get_images_per_page(0)):
        output_file = "image_" + str(i) + ".jpg"
        with open(output_file, "wb") as image_file_handle:
            img.save(image_file_handle)

You can obtain pText either on GitHub, or using PyPi There are a ton more examples, check them out to find out more about working with images.

Joris Schellekens
  • 8,483
  • 2
  • 23
  • 54