I have working code that uses AWS Textract to perform OCR in PDFs, and generally have no issues with alignment. But in a recent test document, the redactions performed show up exactly 90 degrees rotated in relation to the PDF image.
So far I've been trying to analyze the Textract JSON to see if it includes any info about page orientation, but I can't find anything. Is there some mechanism for identifying whether Textract's bounding box info is rotated? Sample image showing page snippet vs. redactions
[Edit] Here's the OCR code at present:
for page in doc:
fileTotalPages += 1 # increment page count
myText = page.get_text().encode("utf8")
page.wrap_contents()
pix = page.get_pixmap()
page_jpg = pix.tobytes(output = 'jpg')
img = np.asarray(bytearray(page_jpg), dtype="uint8")
img = cv2.imdecode(img, 0)
iHeight, iWidth = img.shape[:2]
# hide existing text by writing a full-page text redaction
page.add_redact_annot(page.rect, fill = None, text="", text_color = None)
page.apply_redactions(images = PDF_REDACT_IMAGE_NONE)
ocrDict = ocr_page_textract(page, page.number)
for item in ocrDict["Blocks"]:
if item["BlockType"] == "LINE" or item["BlockType"] == "WORD":
ocrText = item["Text"]
ocrConf = item["Confidence"]
geo = item["Geometry"]
box = geo["BoundingBox"]
x0 = box["Left"] * iWidth # left side
y0 = box["Top"] * iHeight
height = box["Height"] * iHeight
width = box["Width"] * iWidth
x1 = x0 + width # computed width
y1 = y0 + height # computed height
matrix = fitz.Rect(0, 0, 1, 1).torect(page.rect)
ocrRect = fitz.Rect(x0, y0, x1, y1)
bbox = ocrRect * matrix
textLen = font.text_length(ocrText, fontsize=1)
fontSize = ocrRect.width / textLen
page.insert_text(ocrRect.bl,
ocrText,
fontsize = fontSize,
fontname = "helv",
render_mode = 3)