1

I am converting HTML to docx using http://www.codeproject.com/Articles/91894/HTML-as-a-Source-for-a-DOCX-File.

Most of the characters are read properly but some special characters such as •,“ ” are being displayed as •. What should I be doing to correct this?

The HTML that I was passing to HTMLtoDocx was also not reading special characters properly. Instead it was displaying as '?'. After changing the encoding to Encoding.Default it's returning the correct characters. In HTMLtoDOCX there are two places that I can set encoding(lines below). In both the places I Tried changing the encoding format from Encoding.UTF8 to Encoding. But it isn't helping.

StreamWriter streamStartPart = new StreamWriter(docpartDocumentXML.GetStream(FileMode.Create, FileAccess.Write), Encoding.Default);
byte[] Origem = Encoding.Default.GetBytes(html);
San
  • 1,797
  • 7
  • 32
  • 56

1 Answers1

0

• indicates a UTF-8 sequences incorrectly interpreted as ANSI (=Encoding.Default).

You should check whether the HTML file is read with the correct encoding.

While the encoding info is available in the HTTP Header or in HTML META tags, this encoding may not be correct if the HTML is read from a file.

Since .Net treats string characters as 2-byte Unicode values, making sure the correct encoding is apply to read and write byte streams is the first step to fix your problem.

devio
  • 36,858
  • 7
  • 80
  • 143
  • Encoding in Meta tags is set to UTF-8 (). I am reading the current page html and processing it. In debug mode I have verified that characters are displayed properly till it is prosessed by HTMLtoDOCX. In HTMLTODOCX i have changed back to byte[] Origem = Encoding.UTF8.GetBytes(html); – San Feb 21 '13 at 13:37