Special characters for Docx with ooxml

Question

I am converting HTML to docx using http://www.codeproject.com/Articles/91894/HTML-as-a-Source-for-a-DOCX-File.

Most of the characters are read properly but some special characters such as •,“ ” are being displayed as â€¢. What should I be doing to correct this?

The HTML that I was passing to HTMLtoDocx was also not reading special characters properly. Instead it was displaying as '?'. After changing the encoding to Encoding.Default it's returning the correct characters. In HTMLtoDOCX there are two places that I can set encoding(lines below). In both the places I Tried changing the encoding format from Encoding.UTF8 to Encoding. But it isn't helping.

StreamWriter streamStartPart = new StreamWriter(docpartDocumentXML.GetStream(FileMode.Create, FileAccess.Write), Encoding.Default);
byte[] Origem = Encoding.Default.GetBytes(html);

is HTML page UTF-8 encoded? then you should use Encoding.UTF8.GetBytes(...) — el vis, Feb 21 '13 at 08:56
OK have you tried then change StreamWriter constructor with ENCODING.utf8 ? — el vis, Feb 21 '13 at 14:01
Adding and adding ENCODING.utf8 to StreamWriter constructor resolved the issue. thanks — San, Feb 21 '13 at 14:03

devio · Accepted Answer · 2013-02-21T08:51:25.283

0

â€¢ indicates a UTF-8 sequences incorrectly interpreted as ANSI (=Encoding.Default).

You should check whether the HTML file is read with the correct encoding.

While the encoding info is available in the HTTP Header or in HTML META tags, this encoding may not be correct if the HTML is read from a file.

Since .Net treats string characters as 2-byte Unicode values, making sure the correct encoding is apply to read and write byte streams is the first step to fix your problem.

edited Feb 21 '13 at 08:51

answered Feb 21 '13 at 08:44

devio

36,858
7
80
143

Encoding in Meta tags is set to UTF-8 (). I am reading the current page html and processing it. In debug mode I have verified that characters are displayed properly till it is prosessed by HTMLtoDOCX. In HTMLTODOCX i have changed back to byte[] Origem = Encoding.UTF8.GetBytes(html); – San Feb 21 '13 at 13:37

Special characters for Docx with ooxml

1 Answers1