0

I try to use OCR (Optical Character Reader) for a lot of documents of the same type. I use pdf2image library for Python. But when it sees pdfs with AutoCAD shx text it captures the bounding boxes around text as well. At first they are not visible on pdf. You need to click on text to see the boxes. But they appear in jpg result after conversion.

I expect somethink like that: pdf after conversion how it should be

Here's my function for pdf2image conversion:

def get_image_for_blueprint(path, dpi):
    """Create image for full blueprint from path
    path: path to pdf file
    """
    from pdf2image import convert_from_bytes
    images = convert_from_bytes(open(path, 'rb').read(), dpi=dpi)  # Actual conversion function
    for i in images:
        width, height = i.size
        if width < height:
            continue
        else:
            print(i.size)  # tuple : (width, height)
            image = i.resize((4096, int(height / width * 4096)))
            enhancer = ImageEnhance.Sharpness(image)
            image = enhancer.enhance(2)  # Sharpness
            enhancer = ImageEnhance.Color(image)
            image = enhancer.enhance(0)  # black and white
            enhancer = ImageEnhance.Contrast(image)
            image = enhancer.enhance(2)  # Contrast
            image = np.asarray(image)  # array
            image = image.astype(np.uint8)
            return image

I found solutions to be made in AutoCAD before saving the document but I can not really find a way to get rid of these boxes having pdf only.(in python or c++)

Maybe it's possible to resolve using any other programming language library or additional software.

dzhu_man_dzhi
  • 82
  • 1
  • 1
  • 6
ArsenK
  • 1

0 Answers0