How to successfully convert math papers to plain text

Question

Goals:

1.Develop a canonical method to use plain text to uniquely represent STEM papers in general and math papers in particular.

Develop softwares that can convert existing typed STEM papers into that canonical form with 100% accuracy. Note that I can't tolerate any inaccuracy simply because as a single individual I can't proofread millions of papers to correct inaccuracy in conversion even at the rate of 0.001 errors per paper on average.

Problems:

All PDF to text, TeX to text etc programs I have seen here on Stackoverflow and elsewhere such as PyMuPDF do not really work due to math symbols that can not be processed.

2.PDF is really hard to process.

3.TeX is really hard to process because of numerous macros STEM paper authors tend to add to their source files which tend to break LatexML and other converters. It is very easy to process my own papers because I don't use a lot of new commands. However there are many authors whose papers contain \def macros which can not even be processed by de-macro. To actually get TeX to work, assuming that I can even get source files of most papers on arXiv at all, I will pretty much have to actually write my own variant of TeX engine that somehow expand all required macros and produce a plain text document.

Is there any other way to solve this problem? Currently the target format I prefer is pretty much just plain text + math symbols written in LaTeX without formatting other than those that are semantically significant such as \mathcal{A} and A being separate entities. I can learn to set up a neural network to train it to understand these printed math symbols assuming that my laptop is sufficiently powerful. There are literally just less than 200 symbols for the network to learn and their shapes should be very easy to recognize due to lack of variation. Shall I do that?

This is an extremely broad request that requires extremely accurate results without any examples of what to work with in general. — Werner, Nov 21 '18 at 01:51
@Werner Sure. My goal is to convert the text in a random paper such as https://arxiv.org/abs/1802.00001 to plain text while retaining all the semantically significant information. — Ying Zhou, Nov 21 '18 at 18:40
"100% accuracy": That is impossible. Also, StackOverflow is not a way to avoid paying software engineers doing months/years of work or avoid proper requirements engineering. — Martin Thoma, Jul 26 '23 at 20:24

score 1 · Answer 1 · answered Nov 20 '18 at 10:56

1

Yes you can try that. Recognition of symbols, with subsequent transformation of them into LaTeX format(for example, for every square root to write \sqrt).

You can further refer to the issue of recognition to this paper:

https://www.sciencedirect.com/science/article/abs/pii/003132039090113Y -

Recognition of handwritten symbols

Torfinn Taxt,Jórunn B.Ólafsdóttir,MortenDæhlen∥

http://neuralnetworksanddeeplearning.com/chap1.html - here you can find out more, with code samples, on implementation of neural network to handwritten manuscripts.

answered Nov 20 '18 at 10:56

Farid Hasanov

11
3

Really thanks! Well I don't even need to process handwritten symbols. Just typical papers on arXiv that typically have very limited range of fonts. – Ying Zhou Nov 20 '18 at 16:30
Oh , it makes the issue even easier! – Farid Hasanov Nov 21 '18 at 08:03
With handwritten symbols there is an issue of every symbol being subjected to human error and inaccuracy in writing. You seem to be relieved from that error. – Farid Hasanov Nov 21 '18 at 08:04

How to successfully convert math papers to plain text

1 Answers1