I have a TessBaseAPI() object with a returned object. I want to extract the words with their bounding box but can't seem to get it working.
val Text = tesseract.getUTF8Text()
gives me the text.
val Words = tesseract.getWords.boxRects
gives me the bounding boxes that I can loop through but they don't match with getUTF8Text().
Looping through the data object in tesseract.getWords and trying to convert it to string gives me jibberish.
val Words = tesseract.getWords
for(i in Words) {
Log.i(TAG, i.data.toString())
}
I found a really bad workaround by using .getHOCRText and doing regex on the produced content to get the text and the boxes.
val result = tesseract.getHOCRText(0)
val BoxPattern = Pattern.compile("(?<=title='bbox ).*?(?=; x_wconf)")
val BoxMatch = BoxPattern.matcher(result)
while(BoxMatch.find()) {
Log.i(TAG, BoxMatch.group().toString())
}
val TextPattern = Pattern.compile("(?<='>).*?(?=<\\/span>)")
val TextMatch = TextPattern.matcher(result)
while(TextMatch.find()) {
Log.i(TAG, TextMatch.group().toString())
}
So, how can I properly extract the text and boxRects from tess-two?