Convert LaTeX markup to HTML

Question

[UPDATED]

This is my task – Converting a bunch of custom built LaTeX files to into InDesign. So my current method is: run the .tex files through a PHP script that changes the custom LaTeX codes to more generic TeX codes, then I'm using TeX2Word to convert them to .doc files, and then placing those into InDesign.

What I'm wanting to do with this preg_replace is convert a few of the TeX tags so they won't be touched by TeX2Word, then I'll be able to run a script in InDesign that changes the HTML-like tags to InDesign text frames, footnotes, variables and such.

[/UPDATED]

I have some text with LaTeX markup in it:

$newphrase = "\blockquote{\hspace*{.5em}Lorem ipsum dolor sit amet, consectetur
adipiscing elit. Integer posuere erat a ante venenatis dapibus posuere
velit aliquet. Aenean lacinia bibendum nulla sed consectetur. Aenean
eu leo quam. Pellentesque ornare sem lacinia quam venenatis
vestibulum. Sed posuere consectetur est at lobortis. \note{Integer
posuere erat a ante venenatis dapibus posuere velit aliquet.
\textit{Vivamus} sagittis lacus vel augue laoreet rutrum faucibus
dolor auctor.}}";

What I want to do is remove \blockquote{...} and replace it with <div>...</div>

So I've tried a jillion different versions of this:

$regex = "#(blockquote){(.*)(})#";
$replace = "<div>$2</div>";
$newphrase = preg_replace($regex,$replace,$newphrase);

This is the output

\<div>\hspace*{.5em</div>Lorem ipsum dolor sit amet, consectetur
adipiscing elit. Integer posuere erat a ante venenatis dapibus posuere
velit aliquet. Aenean lacinia bibendum nulla sed consectetur. Aenean
eu leo quam. Pellentesque ornare sem lacinia quam venenatis
vestibulum. Sed posuere consectetur est at lobortis. \note{Integer
posuere erat a ante venenatis dapibus posuere velit aliquet.
\textit{Vivamus} sagittis lacus vel augue laoreet rutrum faucibus
dolor auctor.}}";

The first problem with it is that it replaces everything from \blockquote{ to the first }. When I want it to ignore the next } if there has been another { after the initial \blockquote{.

The next problem I'm having is with the \ I can't seem to escape it! I've tried \\, /\\/, \\\, /\\\/, [\], [\\]. Nothing works! I'm sure it's because I don't understand how it really IS suposed to work.

So finally, This is what I want to end up with:

<div>\hspace*{.5em}Lorem ipsum dolor sit amet, consectetur
adipiscing elit. Integer posuere erat a ante venenatis dapibus posuere
velit aliquet. Aenean lacinia bibendum nulla sed consectetur. Aenean
eu leo quam. Pellentesque ornare sem lacinia quam venenatis
vestibulum. Sed posuere consectetur est at lobortis. \note{Integer
posuere erat a ante venenatis dapibus posuere velit aliquet.
\textit{Vivamus} sagittis lacus vel augue laoreet rutrum faucibus
dolor auctor.}</div>";

I'm planning to make $regex & $replace into arrays, so I can replace things like \textit{Vivamus} with this <em>Vivamus</em>

Any guidance would be MUCH welcomed and appreciated!

Have you considered using a dedicated LaTeX to HTML converter? I'm sure that such things already exist and will save you from implementing your own regex-based LaTeX mangling (which will almost certainly be incomplete). — mu is too short, Apr 11 '12 at 04:04
I have looked in that, my problem is that none of the ones I've found allow custom markup tags. I have about 5,000 pages of LaTeX books that are full of custom LaTeX tags. :-( — Circle B, Apr 11 '12 at 14:26
The other thing is that all of my documents are "text only" there aren't any formulas, most of the converters that I have found are focused on math formulas. — Circle B, Apr 11 '12 at 14:40

score 3 · Accepted Answer · answered Apr 16 '12 at 18:08

If you still want to do the conversion yourself, you can do it using multiple passes thru the string, replacing the inner elements first:

$t = '\blockquote{\hspace*{.5em}Lorem ipsum dolor sit amet, consectetur
adipiscing elit. Integer posuere erat a ante venenatis dapibus posuere
velit aliquet. Aenean lacinia bibendum nulla sed consectetur. Aenean
eu leo quam. Pellentesque ornare sem lacinia quam venenatis
vestibulum. Sed posuere consectetur est at lobortis. \note{Integer
posuere erat a ante venenatis dapibus posuere velit aliquet.
\textit{Vivamus} sagittis lacus vel augue laoreet rutrum faucibus
dolor auctor.}}';

function hspace($m) { return "<br />"; }
function textit($m) { return "<i>" . $m[1] . "</i>"; }
function note($m) { return "<b>" . $m[1] . "</b>"; }
function blockquote($m) { return "<quote>" .  $m[1] . "</quote>"; }

while (true) {
  $newt = $t;
  $newt = preg_replace_callback("/\\\\hspace\\*\\{([^{}]*?)\\}/", "hspace", $newt);
  $newt = preg_replace_callback("/\\\\textit\\{([^{}]*?)\\}/", "textit", $newt);
  $newt = preg_replace_callback("/\\\\note\\{([^{}]*?)\\}/", "note", $newt);
  $newt = preg_replace_callback("/\\\\blockquote{([^{}]*?)\\}/", "blockquote", $newt);

  if ($newt == $t) break;
  $t = $newt;
}

echo $t;

But of course, this might work for simple examples, but you cannot use this method to correctly parse the whole TeX format. Also it gets very ineffective for longer inputs.

That looks great! But like what you said about long inputs, some of my files are incredibly large... — Circle B, Apr 17 '12 at 02:13
I'll go ahead and accept this because it really answers the question that I asked. Even though it's not exactly what I'm looking for, I'll probably use some of the concept and it is an excellent answer–Thanks! @kuba — Circle B, Apr 19 '12 at 15:04

score 0 · Answer 2 · answered Apr 16 '12 at 08:38

0

As suggested above, you can use a dedicated LaTeX to HTMl converter such as: SimpleTex4ht.

answered Apr 16 '12 at 08:38

Bastiaan Quast

2,802
1
24
50

That works fairly well, that problem is that I'm not looking to convert the entire document to HTML. – Circle B Apr 16 '12 at 17:40

Convert LaTeX markup to HTML

2 Answers2