52

I need an open OCR library which is able to scan complex printed math formulas (for example some formulas which were generated via LaTeX). I want to get some LaTeX-like output (or just some AST-like data).

Is there something like this already? Or are current OCR technics just able to parse line-oriented text?

(Note that I also posted this question on Metaoptimize because some people there might have additional knowledge.)

The problem was also described by OpenAI as im2latex.

Albert
  • 65,406
  • 61
  • 242
  • 386
  • Are your formulas handwritten or printed? – Jasper Aug 25 '10 at 21:17
  • printed is fine for me now. and otherwise it might be too difficult anyway :) whereby I guess some engine which is able to handle handwritten formulas will also be able to handle printed ones. – Albert Aug 25 '10 at 21:21
  • 1
    Have you found the solution? – yibotg Mar 11 '14 at 00:14
  • @tan9p: Unfortunately, no. I have seen several research projects over the time, but none of it with a nice working final tool. And the closed source solutions listed in the answers. – Albert Mar 11 '14 at 11:33
  • 1
    you can use the mathpix API: https://mathpix.github.io/docs/ which supports handwritten / printed math and is free up to 2000 images per month. – nicodjimenez Dec 20 '16 at 21:26

10 Answers10

28

SESHAT is a open source system written in C++ for recognizing handwritten mathematical expressions. SESHAT was developed as part of a PhD thesis at the PRHLT research center at Universitat Politècnica de València.

An online demo:http://cat.prhlt.upv.es/mer/

The source: https://github.com/falvaro/seshat

Seshat is an open-source system for recognizing handwritten mathematical expressions. Given a sample represented as a sequence of strokes, the parser is able to convert it to LaTeX or other formats like InkML or MathML.

Slothworks
  • 1,083
  • 14
  • 18
6

Check out "Web Equation." It can convert handwritten equations to LaTeX, MathML, or SymbolTree. I'm not sure if the engine is open source.

Geremia
  • 4,745
  • 37
  • 43
6

According to the answers on Metaoptimize and the discussion on the Tesseract mailinglist, there doesn't seem to be an open/free solution yet which can do that.

The only solution which seems to be able to do it (but I cannot verify as it is Windows-only and non-free) is, like a few other people have mentioned, the InftyProject.

Albert
  • 65,406
  • 61
  • 242
  • 386
  • 4
    InftyProject OCR (which is now located at http://www.inftyreader.org/?p=29 I believe) isn't that good :( http://img402.imageshack.us/img402/7875/testinftyproject.png – Franck Dernoncourt Oct 13 '12 at 20:01
6

InftyReader is the only one I'm aware of. It is NOT free software (it seems the money goes to a non-profit org, IIRC).

http://www.sciaccess.net/en/InftyReader/

I don't know why PDF can't have metadata in LaTeX? As in: put the LaTeX equation in it! Is this so hard? (I dunno anything about PDF syntax, but I imagine it can be done).

LaTeX syntax is THE ONE TRIED AND TRUE STANDARD for mathematics notation. It seems amazingly stupid that folks that produced MathML and other stuff don't take this in consideration. InftyReader generates MathML or LaTeX syntax.

If I want HTML (pure) I then use TTH to read the LaTeX syntax. Just works.

ABBYY FineReader (a great OCR program) claims you can train the software for Math, but this is immensely braindead (who has the time?)

And Unicode has lots of math symbols. That today's OCR readers can't grok them shows the sorry state of software and the brain deficit in this activity.

As to "one symbol at a time", TeX obviously has rules as to where it will place symbols. They can't write software that know those rules?! TeX is even public domain! They can just "use it" in their comercial products.

jjc
  • 61
  • 1
  • 1
2

Considering that current technologies read one symbol at a time (see http://detexify.kirelabs.org/classify.html), I doubt there is an OCR for full mathematical equations.

Starkey
  • 9,673
  • 6
  • 31
  • 51
  • Yea, that is what I know about most engines. Though I hoped that there might be some progress on this. Anyway, wow, thanks for that link, quite interesting and useful! :) That will help me identifying some symbols in the future which I don't know what they are called and what they are standing for, so I will get some text I can at least Google for! – Albert Aug 25 '10 at 21:24
2

Infty works fairly well. My former company integrated it into an application that reads equations out loud for blind people and is getting good feedback from users.

http://www.inftyproject.org/en/download.html

Yaroslav Bulatov
  • 57,332
  • 22
  • 139
  • 197
  • The download link seems broken. Also, is this open? It must be cross platform and in form of a library I can use. – Albert Aug 27 '10 at 14:22
  • Link works for me. I found it by googling "infty." It is not open and "mostly" commercial. Meaning, it's commercial, but it's developed and maintained by a group at a university who are sometimes open to working out a deal for non-profits. Out of all packages we evaluated, this one was the only that got above passable performance on math formulas, let me know if you find something better. – Yaroslav Bulatov Aug 27 '10 at 18:37
  • +1) Link works for me too, it is interesting indeed. Have you tested how it works to scan hand written mathematics (on a piece of paper) into LaTeX? – AD - Stop Putin - Oct 05 '12 at 11:39
  • I did not, but my gut feeling is that accuracy will be too poor to be usable on hand written mathematics. – Yaroslav Bulatov Oct 05 '12 at 20:23
1

Since the output from math OCR for complex formulas will likely have bugs -- even humans have trouble with it -- you will have to proofread th results, at least if they matter. The (human) proofreader will then have to correct the results, meaning you need to have a math formula editor. Given the effort needed by humans, the probably limited corpus of complex formulas, you might find it easier to assign the task to humans.

As a research problem, reading math via OCR is fun -- you need a formalism for 2-D grammars plus a symbol recognizer.
In addition to references already mentioned here, why not google for this? There is work that was done at Caltech, Rochester, U. Waterloo, and UC Berkeley. How much of it is ready to use out of the box? Dunno.

Richard Fateman
  • 256
  • 1
  • 1
1

As of August 2019, there are a few options, depending on what you need: For converting printed math equations/formulas to LaTex, Mathpix is absolutely the best choice. It's free. For converting handwritten math to LaTex or printed math, MyScript is the best option, although its app costs a few dollars.

0

there is this great short video: http://www.youtube.com/watch?v=LAJm3J36tLQ explaining how you can train your Fine Reader to recognize math formulas. If you use Fine Reader already, better to stick with one tool. Of course it is not free ware :(

mPrinC
  • 9,147
  • 2
  • 32
  • 31
  • That is not really what I was asking about. I meant complex formulas - not line-based text. That is the whole point of the question and the tricky part which makes it different from traditional OCR like FineReader. – Albert Nov 25 '12 at 11:30
0

You know, there's an application in Win7 just for that: Math Input Panel. It even handles handwritten input (it's actually made for this). Give it a shot if you have Win7, it's free!

Blindy
  • 65,249
  • 10
  • 91
  • 131