I would like to remove gridlines from a scanned document using Python to make them easier to read.
Here is a snippet of what we're working with:
As you can see, there are inconsistencies in the grid, and to make matters worse the scanning isn't always square. Five example documents can be found here.
I am open to whatever methods you may suggest for this, but using openCV and pypdf might be a good place to start before any more involved breaking out the machine learning techniques.
This post addresses a similar question, but does not have a solution. The user posted the following code snippet which may be of interest (to be honest I have not tested it, I am just putting it here for your convivence).
import cv2
import numpy as np
def rmv_lines(Image_Path):
img = cv2.imread(Image_Path)
gray = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)
edges = cv2.Canny(gray,50,150,apertureSize = 3)
minLineLength, maxLineGap = 100, 15
lines = cv2.HoughLinesP(edges,1,np.pi/180,100,minLineLength,maxLineGap)
for x in range(0, len(lines)):
for x1,y1,x2,y2 in lines[x]:
#if x1 != x2 and y1 != y2:
cv2.line(img,(x1,y1),(x2,y2),(255,255,255),4)
return cv2.imwrite('removed.jpg',img)
I would prefer the final documents be in pdf format if possible.