I want to develop a tool, that extracts the sickness dates from a certificate of disability. In Germany these certificates are standardized forms ("Arbeitsunfähigkeitsbescheinigung"), that contain dates like this:
I tried using Tess4j and extracted the dates with tesseract.doOCR(filename, new Rectangle(300, 400, 200, 100))
, but unfortunately these lines from the form are very often recognised as 1
or as |
. Also the dots are sometimes detected and sometimes not. How do you usually treat this noise? Is it possible to fine-tune tesseract on my own training data?
Please note that the printer of the doctor often is not configured properly to print the dates inside the boxes of the form. So it's not possible to filter the lines them out easily.