0

I used the ocr client tesseract to generate ocr text and location data in an hocr file. I'd like top create a pdf from the images with an invisible layer of text from tesseract embeedded within it. I can't figure out how to do this. Generating a pdf without the text data is easy:

NSMutableData *pdfFile = [[NSMutableData alloc] init];
UIImage *image = [UIImage imageWithCGImage:[self.sourceImageArray[0] CGImage]];
CGRect rect;
rect = CGRectMake(0, 0, image.size.height ,image.size.width);
UIGraphicsBeginPDFContextToData(pdfFile, CGRectZero, nil);
for (int i = 0; i < [self.sourceImageArray count] ; i++){
        UIGraphicsBeginPDFPageWithInfo(rect, nil);
        UIImage *contextImage = self.sourceImageArray[i];
        [contextImage drawInRect:rect];
    }
UIGraphicsEndPDFContext();
NSArray *paths = NSSearchPathForDirectoriesInDomains(NSDocumentDirectory,                                                NSUserDomainMask, YES);
NSString *documentsDirectory = [paths objectAtIndex:0];
NSString* path = [documentsDirectory stringByAppendingPathComponent:@"multipage.pdf"];
NSData* data = pdfFile;
[data writeToFile:path atomically:YES];

In PDF source code, invisible text can be written using Text Rendering Mode 3 ('Neither fill nor stroke glyph shapes'). That's how OCR inserts its text into PDF pages which basically consist of only a scanned image.

So the question is how can I render text to a pdf with quartz in invisible mode 3. Any help would be really appreciated!

Nathaniel Waisbrot
  • 23,261
  • 7
  • 71
  • 99
M.R.
  • 1,053
  • 2
  • 13
  • 30

1 Answers1

0

You cannot render text using the render mode 3. What you can do is to draw regular text on the page and then draw the images. The images will mask the text and it will not be visible. For text search operations there is no difference between rendering mode 0 and 3.

Mihai Iancu
  • 1,818
  • 2
  • 11
  • 10