I used the ocr client tesseract to generate ocr text and location data in an hocr file. I'd like top create a pdf from the images with an invisible layer of text from tesseract embeedded within it. I can't figure out how to do this. Generating a pdf without the text data is easy:
NSMutableData *pdfFile = [[NSMutableData alloc] init];
UIImage *image = [UIImage imageWithCGImage:[self.sourceImageArray[0] CGImage]];
CGRect rect;
rect = CGRectMake(0, 0, image.size.height ,image.size.width);
UIGraphicsBeginPDFContextToData(pdfFile, CGRectZero, nil);
for (int i = 0; i < [self.sourceImageArray count] ; i++){
UIGraphicsBeginPDFPageWithInfo(rect, nil);
UIImage *contextImage = self.sourceImageArray[i];
[contextImage drawInRect:rect];
}
UIGraphicsEndPDFContext();
NSArray *paths = NSSearchPathForDirectoriesInDomains(NSDocumentDirectory, NSUserDomainMask, YES);
NSString *documentsDirectory = [paths objectAtIndex:0];
NSString* path = [documentsDirectory stringByAppendingPathComponent:@"multipage.pdf"];
NSData* data = pdfFile;
[data writeToFile:path atomically:YES];
In PDF source code, invisible text can be written using Text Rendering Mode 3 ('Neither fill nor stroke glyph shapes'). That's how OCR inserts its text into PDF pages which basically consist of only a scanned image.
So the question is how can I render text to a pdf with quartz in invisible mode 3. Any help would be really appreciated!