0

I have working code that uses AWS Textract to perform OCR in PDFs, and generally have no issues with alignment. But in a recent test document, the redactions performed show up exactly 90 degrees rotated in relation to the PDF image.

So far I've been trying to analyze the Textract JSON to see if it includes any info about page orientation, but I can't find anything. Is there some mechanism for identifying whether Textract's bounding box info is rotated? Sample image showing page snippet vs. redactions

[Edit] Here's the OCR code at present:

for page in doc:   
        fileTotalPages += 1 # increment page count   
        myText = page.get_text().encode("utf8")               
        page.wrap_contents()
        pix = page.get_pixmap()   
        page_jpg = pix.tobytes(output = 'jpg') 
        img = np.asarray(bytearray(page_jpg), dtype="uint8")   
        img = cv2.imdecode(img, 0)   
        iHeight, iWidth = img.shape[:2]
        # hide existing text by writing a full-page text redaction
        page.add_redact_annot(page.rect, fill = None, text="", text_color = None)  
        page.apply_redactions(images = PDF_REDACT_IMAGE_NONE)

        ocrDict = ocr_page_textract(page, page.number)  
        for item in ocrDict["Blocks"]:    
            if item["BlockType"] == "LINE" or item["BlockType"] == "WORD":      
                ocrText = item["Text"]    
                ocrConf = item["Confidence"]                    
  
                geo = item["Geometry"]   
                box = geo["BoundingBox"]   
                x0 = box["Left"] * iWidth # left side    
                y0 = box["Top"] * iHeight    
                height = box["Height"] * iHeight    
                width = box["Width"] * iWidth 
                x1 = x0 + width # computed width   
                y1 = y0 + height # computed height
   
                matrix = fitz.Rect(0, 0, 1, 1).torect(page.rect) 
                ocrRect = fitz.Rect(x0, y0, x1, y1)  
                bbox = ocrRect * matrix

                textLen = font.text_length(ocrText, fontsize=1)    
                fontSize = ocrRect.width / textLen

                page.insert_text(ocrRect.bl,
                                 ocrText,
                                 fontsize = fontSize,
                                 fontname = "helv",
                                 render_mode = 3)
REJ
  • 1
  • 2
  • what have you tried so far ? the question needs sufficient code for a [minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example) – D.L Jun 29 '23 at 19:20
  • I realize this is fairly nebulous, but I'm not even sure I can provide much more detail since I've no idea what characteristic is causing this single page to be rendered incorrectly (all others in the same doc are correct). This is literally the first document I've tested where the redactions came through rotated. Mainly I'm hoping someone with better knowledge of Textract's JSON can provide a clue. – REJ Jun 29 '23 at 20:00
  • Update -- I think I've figured out the issue, but not the solution. This appears to be a Textract error based on its interpretation of image orientation. The other two pages in the document return width of 612 and height of 792, but the 'bad' page returns the opposite. I found some comments that Textract attempts to determine orientation automatically, so presumably it thinks the page is landscape and thus the coordinates are rotated 90 degrees. The grand question is how to deal with this in terms of redaction. – REJ Jun 30 '23 at 14:30

1 Answers1

0

Check out the section "Page Orientation in Degrees" here: https://pypi.org/project/amazon-textract-response-parser/

Theoretically, that odd page should have different orientation than the others, so you could do something with that information.

from the link:

from trp.t_pipeline import add_page_orientation
import trp.trp2 as t2
import trp as t1

# assign the Textract JSON dict to j
j = <call_textract(input_document="path_to_some_document (PDF, JPEG, PNG)") or your JSON dict>
t_document: t2.TDocument = t2.TDocumentSchema().load(j)
t_document = add_page_orientation(t_document)

doc = t1.Document(t2.TDocumentSchema().dump(t_document))
# page orientation can be read now for each page
for page in doc.pages:
    print(page.custom['PageOrientationBasedOnWords'])
# you could then also dump this to a json response
ocrDict = TDocumentSchema().dump(t_document)
grantr
  • 878
  • 8
  • 16