19

I have the following HTML:

<html><body><p>n<sup>th</sup></p></body></html>

I am using the command:

$ libreoffice --convert-to docx:"MS Word 2007 XML" test.html

To convert that HTML into a DOCX file. However I notice that the resulting DOCX file does not actually contain the <sup> tag. It looks like it is using position and size to replicate the <w:vertAlign> tag:

<w:position w:val="8"/><w:sz w:val="19"/>

What I would need to know is how to make libreoffice put in the <w:vertAlign> tag instead of using position and size.

Additonal Info:

I had a similar problem with bold and italics (<strong><em>) but was able to get the conversion to work correctly if I converted the strong and em tags to b and i tags respectively.

Jason Ward
  • 191
  • 6
  • 2
    I've had similar issues with the libreoffice convert-to docx and spent too much time trying to figure out which tags converted correctly and did not. I've had more consistent success using: https://cloudconvert.org/html-to-docx If you are in a crunch for time I would suggest trying this alternative. Specifically I know that it handles the tag properly. – Brian Gilreath May 27 '14 at 17:03
  • 1
    @BrianGilreath I tried the tool you mentioned with the exact HTML that was posted with my question. The `sup` was still converted to position and size instead of `vertAlign` – Jason Ward May 27 '14 at 17:18
  • 1
    could this be a doctype issue? if you declare html5 doctype before the opening html element, will you get a different result? – albert Jun 01 '14 at 17:38
  • Do you need to convert it via libreoffice ? – user3241019 Jun 03 '14 at 12:35
  • @albert I have tried different doctypes, none of them seem to help here. @ user3241019 I don't need to convert using libreoffice, however that is the best tool I have found in the general case. – Jason Ward Jun 03 '14 at 14:22

4 Answers4

1

If you are looking to edit the HTML, it would be much better to use a tool that is suited for editing HTML, such as Notepad++ or Sublime (as examples).

If you need to have the HTML as a LibreOffice document for a specific reason, you could open the HTML file in Notepad and save as a text file with .txt as the extension. That should allow you to open the document in LibreOffice.

Patricia Green
  • 495
  • 5
  • 14
  • I am looking to give our users the ability to edit HTML, even though most of our users are not familiar with HTML. Most of our users are quite proficient with Microsoft Word, so it makes sense to convert the HTML to DOCX for editing in word. I already have a tool to convert the DOCX file back to HTML. – Jason Ward Jun 16 '14 at 14:25
  • It's been a while since I've been new to HTML, so just asking...Notepad is too much to ask them to learn? It's not meant to be a derogatory question, merely curious. Learning the right tools is really the beginning to learn to program. There's also online tools that allow you to code completely within the browser, like: http://scratchpad.io/ – Patricia Green Jun 18 '14 at 19:22
  • Our users have zero technical background. In order for them to use Notepad they would first need to learn how to structure HTML. Our tool gives them a way to edit their documents with a tool they are familiar with (Microsoft Word). We currently have a "What You Mean Editor" that is a JS tool to edit HTML in a word like fashion. However it's clunky and a bit buggy. – Jason Ward Jun 19 '14 at 15:46
1

You can try using a WYSIWYG(What You See Is What You Get) editor like TinyMCE(http://www.tinymce.com/). There are lots of them online and you can also find some desktop applications for that. but if you want to convert it in docx you can try this http://htmltodocx.codeplex.com/ it is written in php and uses PHPWord and is quite efficient.

kk3nny
  • 23
  • 6
  • We already use a WYSIWYG editor (https://github.com/wymeditor/wymeditor)[wymeditor]. We are specifically trying to get around using it since our customers are not as comfortable using the WYSIWYG editor compared to Microsoft Word. I looked into `htmltodocx` briefly, however we don't use PHP nor is that something we want to use. What I want to do is find a way to tell libreoffice what these tags are supposed to be and the conversion work with libreoffice, like my question asked. – Jason Ward Jul 28 '14 at 14:12
1

Just create a Python script that replaces your unwanted tags with the <w:vertAlign> tag where ever needed.

Vivek
  • 315
  • 2
  • 7
  • And how do you propose that I find the tags that are messed up and replace them with the `` tag? Considering that they will almost certainly look different depending on fonts, margins, etc. – Jason Ward Aug 11 '14 at 17:54
1

The command works fine if you replace 'docx' with 'xml', like this:

libreoffice --convert-to xml:"MS Word 2003 XML" test.html
denim2x
  • 119
  • 1
  • 3
  • Unfortunately, I need the resulting document to be a docx file as I am using PyDocx to convert the file back to HTML when the user is done editing the document. – Jason Ward Sep 28 '14 at 23:07
  • I believe MS Word can edit HTML docs directly (correct me if I'm wrong). – denim2x Sep 29 '14 at 12:57
  • Not very well. We do a lot of post processing on the resulting HTML. Regardless. If I was not clear in the question that I wanted a DOCX file then please let me know so I can clear that up. – Jason Ward Sep 29 '14 at 18:55