How to do part-of-speech tagging of texts, containing mathematical expressions?

Question

The goal is a syntactic parsing of scientific texts. And first I need to make part-of-speech tagging of sentences of such texts. Texts are from arxiv.org. So they are originally in LaTeX. When extracting text from LaTeX documents, math expressions can be converted into MathML (or maybe some other format, but I prefer MathML cause this work is being done to create a specific web-app, and MathML is a convenient tool for this).

The only idea I have is to substitute mathematical expressions with some phrases of natural language and then use some implemented algorithm for pos-tagging. So the question is how to implement this substitutions or, in general, how to implement pos-tagging of texts with mathematics in them?

Is it acceptable to remove all the formulae? If yes, all you need to do is to add a rule to your tokenizer to remove math expressions of replace them with something like __formula__ — mbatchkarov, Mar 28 '13 at 21:17
I tried to replace math with some single word. But the problem is that, math expressions can play various syntactic role: they can act as nouns, or as numerals, or as phrases, so this decision gives many mistakes. — kseniyam, Mar 29 '13 at 06:22
Interesting,I have seen similar work with twitter hash tags. Can you please post some examples sentences? — mbatchkarov, Mar 29 '13 at 14:05
Some examples: 1. Application of contact symmetry to DEs was inaugurated by Lie himself , he proved that for ODEs of order n (n≥3), the contact symmetry algebra is finite-dimensional. 2. Using the required continuity at z=0 to find that A=B , we obtain the result... 3. In this article we study the hydrodynamic modes in a granular fluid with a distributed energy injection mechanism similar to the one in the Q2D geometry... If you are intrested in the topic, you can take a look at articles, I've converted into html. https://docs.google.com/file/d/0By1jakHTY7LAazN4eFhlMG1oYzg/edit?usp=sharing — kseniyam, Mar 29 '13 at 16:51
They are not very carefully converted. But it's possible to get some impression of what the problem is. The link is a link to a zip-archive, inside which are articles in html, there is also list.html for more convenient browsing, with links to all articles in the archive. — kseniyam, Mar 29 '13 at 16:55
Interesting question. But could you add the examples you gave in the comment to the question itself (by _editing_ the question)? — jogojapan, Mar 30 '13 at 09:59

score 0 · Answer 1 · answered Apr 27 '13 at 01:24

0

Replacing all of the mathematical formulae with a single, unique word seem to be the way to go.

answered Apr 27 '13 at 01:24

abecadel

329
1
10

score 0 · Accepted Answer · answered Feb 13 '14 at 15:51

I have implemented a formula substitution algorithm on top of the Stanford tagger and it works quite nice. The way to go is, as abecadel has written, to replace every formula with a unique but new word, I used a combination of a word and a hash 'formula-duwkziah'.

How to do part-of-speech tagging of texts, containing mathematical expressions?

2 Answers2