I'm writing a PDF to text solution using OCR in Golang.
The libraries I employed are Gosseract and Go-Fitz
The program works until I'm trying to load an image from memory with Gosseract:
func ProcessDoc(file []byte) (string, error) {
var text strings.Builder
client := gosseract.NewClient()
doc, err := fitz.NewFromMemory(file)
if err != nil {
log.Println(err)
return "", nil
}
for n := 0; n < doc.NumPage(); n++ {
img, err := doc.Image(n)
if err != nil {
log.Println(err)
return "", err
}
buf := new(bytes.Buffer)
err = jpeg.Encode(buf, img, nil)
if err != nil {
log.Println(err)
return "", err
}
client.SetImageFromBytes(buf.Bytes())
res, err := client.Text()
if err != nil {
return "", err
}
text.WriteString(res)
}
return text.String(), nil
}
Then I get this error:
JPEG parameter struct mismatch: library thinks size is 624, caller expects 656
Error in pixReadStreamJpeg: internal jpeg error
Error in pixReadMemJpeg: pix not read
Error in pixReadMem: jpeg: no pix returned
After a lot of searching, I learned there was the possibility of libleptonica
or mupdf
using different versions of jpeglib.h
. But there's only one instance of that file in the whole system.
I should also note that I compiled libjpeg
from source and then libmupdf
and libleptonica
to use that version of libjpeg
to avoid any form of conflicts but it still returns the Struct Mismatch error.