I try to use OCR (Optical Character Reader) for a lot of documents of the same type. I use pdf2image library for Python. But when it sees pdfs with AutoCAD shx text it captures the bounding boxes around text as well. At first they are not visible on pdf. You need to click on text to see the boxes. But they appear in jpg result after conversion.
- Here's an image of a part of pdf document: crop from actual pdf
- And here's the output of conversion: pdf after conversion
I expect somethink like that: pdf after conversion how it should be
Here's my function for pdf2image conversion:
def get_image_for_blueprint(path, dpi):
"""Create image for full blueprint from path
path: path to pdf file
"""
from pdf2image import convert_from_bytes
images = convert_from_bytes(open(path, 'rb').read(), dpi=dpi) # Actual conversion function
for i in images:
width, height = i.size
if width < height:
continue
else:
print(i.size) # tuple : (width, height)
image = i.resize((4096, int(height / width * 4096)))
enhancer = ImageEnhance.Sharpness(image)
image = enhancer.enhance(2) # Sharpness
enhancer = ImageEnhance.Color(image)
image = enhancer.enhance(0) # black and white
enhancer = ImageEnhance.Contrast(image)
image = enhancer.enhance(2) # Contrast
image = np.asarray(image) # array
image = image.astype(np.uint8)
return image
I found solutions to be made in AutoCAD before saving the document but I can not really find a way to get rid of these boxes having pdf only.(in python or c++)
Maybe it's possible to resolve using any other programming language library or additional software.