0

i have been trying to convert the HTML string question_text_html(which is a mathematical question written in HTML ) in the code below to a latex string using pypandoc. but it keeps including the irrelevant strings like "\protect\hypertarget{MJX-...}....." in the converted string

import pypandoc
from selenium import webdriver

driver.get("https://nigerianscholars.com/past-questions/mathematics/? 
    show_answers=yes")
question_blocks=driver.find_elements_by_class_name('question_block')
for question_block in question_blocks:
 question_text=question_block.find_element_by_class_name('question_text')
 question_text_html=question_text.get_attribute('innerHTML')
 question_latex=pypandoc.convert_text(question_text_html,'tex',format='html')
 print(f'Question Html is {question_text_html}')
 print(f'Question latex is {question_latex}')
 

it usually gives

 Question Html is <html><body><p class="q_question">Differentiate <span class="MathJax_Preview" style="color: inherit;"></span><span class="mjx-chtml MathJax_CHTML" data-mathml='&lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mo stretchy="false"&gt;(&lt;/mo&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;mn&gt;5&lt;/mn&gt;&lt;msup&gt;&lt;mo stretchy="false"&gt;)&lt;/mo&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msup&gt;&lt;mo stretchy="false"&gt;(&lt;/mo&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mo&gt;&amp;#x2212;&lt;/mo&gt;&lt;mn&gt;4&lt;/mn&gt;&lt;mo stretchy="false"&gt;)&lt;/mo&gt;&lt;/math&gt;' id="MathJax-Element-1-Frame" role="presentation" style="font-size: 114%; position: relative;" tabindex="0"><span aria-hidden="true" class="mjx-math" id="MJXc-Node-1"><span class="mjx-mrow" id="MJXc-Node-2"><span class="mjx-mo" id="MJXc-Node-3"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.485em; padding-bottom: 0.572em;">(</span></span><span class="mjx-mn" id="MJXc-Node-4"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.397em; padding-bottom: 0.353em;">2</span></span><span class="mjx-mi" id="MJXc-Node-5"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.221em; padding-bottom: 0.309em;">x</span></span><span class="mjx-mo MJXc-space2" id="MJXc-Node-6"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.309em; padding-bottom: 0.441em;">+</span></span><span class="mjx-mn MJXc-space2" id="MJXc-Node-7"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.397em; padding-bottom: 0.353em;">5</span></span><span class="mjx-msubsup" id="MJXc-Node-8"><span class="mjx-base"><span class="mjx-mo" id="MJXc-Node-9"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.485em; padding-bottom: 0.572em;">)</span></span></span><span class="mjx-sup" style="font-size: 70.7%; vertical-align: 0.513em; padding-left: 0px; padding-right: 0.071em;"><span class="mjx-mn" id="MJXc-Node-10" style=""><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.397em; padding-bottom: 0.353em;">2</span></span></span></span><span class="mjx-mo" id="MJXc-Node-11"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.485em; padding-bottom: 0.572em;">(</span></span><span class="mjx-mi" id="MJXc-Node-12"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.221em; padding-bottom: 0.309em;">x</span></span><span class="mjx-mo MJXc-space2" id="MJXc-Node-13"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.309em; padding-bottom: 0.441em;">−</span></span><span class="mjx-mn MJXc-space2" id="MJXc-Node-14"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.397em; padding-bottom: 0.353em;">4</span></span><span class="mjx-mo" id="MJXc-Node-15"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.485em; padding-bottom: 0.572em;">)</span></span></span></span><span class="MJX_Assistive_MathML" role="presentation"><math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">(</mo><mn>2</mn><mi>x</mi><mo>+</mo><mn>5</mn><msup><mo stretchy="false">)</mo><mn>2</mn></msup><mo stretchy="false">(</mo><mi>x</mi><mo>−</mo><mn>4</mn><mo stretchy="false">)</mo></math></span></span><script id="MathJax-Element-1" type="math/tex">(2x+5)^2(x-4)</script> with respect to x.</p></body></html>






Question latex is Differentiate
{}\protect\hypertarget{MathJax-Element-1-Frame}{}{\protect\hypertarget{MJXc-Node-1}{}{\protect\hypertarget{MJXc-Node-2}{}{\protect\hypertarget{MJXc-Node-3}{}{{(}}\protect\hypertarget{MJXc-Node-4}{}{{2}}\protect\hypertarget{MJXc-Node-5}{}{{x}}\protect\hypertarget{MJXc-Node-6}{}{{+}}\protect\hypertarget{MJXc-Node-7}{}{{5}}\protect\hypertarget{MJXc-Node-8}{}{{\protect\hypertarget{MJXc-Node-9}{}{{)}}}{\protect\hypertarget{MJXc-Node-10}{}{{2}}}}\protect\hypertarget{MJXc-Node-11}{}{{(}}\protect\hypertarget{MJXc-Node-12}{}{{x}}\protect\hypertarget{MJXc-Node-13}{}{{−}}\protect\hypertarget{MJXc-Node-14}{}{{4}}\protect\hypertarget{MJXc-Node-15}{}{{)}}}}{\((2x + 5)^{2}(x - 4)\)}}\((2x+5)^2(x-4)\)
with respect to x.

How can i remove all the "\protect\hypertarget{MJXc-Node-10}" from the latex leaving only

Differentiate {\((2x + 5)^{2}(x - 4)\)}}\((2x+5)^2(x-4)\)
with respect to x.
msughter
  • 11
  • 1
  • 3
  • Could you boil this down to a simpler example? I'm not going to debug a program containing irrelevant details, or to endlessly scroll through vertical text to find out what's going on. But I'll be happy to help if it's clear what the question is. – tarleb Jan 08 '21 at 07:40
  • sorry, i have edited the post – msughter Jan 08 '21 at 09:07
  • Now there seems to be something missing in the html output. I assume it has a `` for every element in the equation (probably with identifier "MJXc-Node-*"), and those spans are converted to `\hypertarget` in LaTeX. You may want to use a shorter equation and post the full HTML/MathJAX. – hlg Jan 08 '21 at 09:36
  • i have posted the full html for the question,its a bit much but that was the shortest question I could find, – msughter Jan 08 '21 at 10:11
  • i also tried to remove all the span elements in the equation but the converter returns an empty latex ....{} – msughter Jan 08 '21 at 10:14
  • Further edits which would improve the question: (a) remove all code after the print statements as it seems irrelevant (b) remove code that's been commented out (c) post the actual code (the above would throw an error as `soup` is undefined). – tarleb Jan 08 '21 at 16:34
  • ok, i have edited it and i think running it now, would give you thesame result I got – msughter Jan 08 '21 at 17:54

1 Answers1

0

With MathJax, the equation is actually there in TeX notation initially. The spans are created by MathJax Javascript for the equation layout in HTML. Currently, you let MathJax first render the equation, grab the rendered equation and then try to convert it back to the original TeX equation. It would be more straightforward to directly read the TeX equation without the indirection of Javascript rendering.

To achieve that, you would just need to disable Javascript in Selenium. For example with the Firefox driver this should do the trick:

from selenium.webdriver.firefox.options import Options
from selenium import webdriver

opts = Options()
opts.preferences.update({
    "javascript.enabled": False,
})
driver = webdriver.Firefox(options=opts)

Alternatively, if you need to process the rendered version with Javascript enabled for some reason, you could try to get hold of the content of the script element inside the <p>. It contains the full equation, but without TeX math markup:

<p class="q_question">...<script type="math/tex">(2x+5)^2(x-4)</script>...</p>

This way you would not have to remove the spans. You would then need to enclose it in TeX math markup \(...\) for the PDF.

hlg
  • 1,321
  • 13
  • 29