Questions tagged [character-encoding]

Character encoding refers to the way characters are represented as a series of bytes. Character encoding for the Web is defined in the Encoding Standard.

Character encoding is the act or result of representing characters (human-readable text/symbols such as a or or ) as a series of bytes (computer-readable zeroes and ones).

Briefly, just like changing the font from Arial to Wingdings changes what your text looks like, changing encodings affects the interpretation of a sequence of bytes. For example, depending on the encoding, the bytes1 0xE2 0x89 0xA0 could represent the text ≠in Windows code page 1252, or Б┴═ in KOI8-R, or the character in UTF-8.

A useful reference is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

The Encoding Standard at https://encoding.spec.whatwg.org/ defines character encoding for the Web. It mandates the use of UTF-8 on the Web, and defines other encodings as legacy/obsolete.

Of course, if the file you are looking at does not contain text, that means it does not encode any characters, and thus, character encoding is not meaningful or well-defined. A common beginner problem is trying to read a binary file as text and being surprised that you get a character encoding error. But the fix in this situation is to read the file in binary mode instead. For example, many office document, audio, video, and image formats, and proprietary file formats are binary files.

How Can I Fix the Encoding?

If you are a beginner who just needs to fix an acute problem with a text file, see if your text editor provides an option to save a file in a different encoding. Understand that not all encodings can accommodate all characters (so, for example, Windows code page 1252 cannot save text which contains Chinese or Russian characters, emoji, etc) or, if you know the current encoding and what you want to change it into, try a tool like iconv or GNU recode.

Which Character Encoding is This?

Questions asking for help identifying or manipulating text in a particular encoding are frequent, but often problematic. Please include enough information to help us help you.

Bad: "I look at the text and I see óòÒöô, what is this"?

Good: "I have text in an unknown encoding in a file. I cannot view this text in UTF-8, but when I set my system to use ISO-8859-1, I see óòÒöô. I know this isn't right; the text is supposed to be <text> in <language>. A hex dump of the beginning of the file shows

    000000 9e 9f 9a a0 af b4 be f0  9e af b3 f2 20 b7 5f 20

Bad: Anything which tries to use the term "ANSI" in this context2

Legacy Microsoft Windows documentation misleadingly uses "ANSI" to refer to whichever character set is the default for the current locale. But this is a moving target; now, we have to guess your current locale, too.

Better: Specify the precise code page

Commonly on Western Windows installations, you will be using CP-1252; but of course, if you have to guess, you need to say so, too.

Notice:

  • We cannot guess which encoding you are using to look at the mystery data. Please include this information if you are genuinely trying to tell us what you see.

  • A copy/paste is rarely sufficient, because this introduces several additional variables (we will need to correctly guess about your web browser's handling of the text, too, and the web server's, and the tool you used to obtain a copy of the text, and so forth).

  • If you know what the text is supposed to represent (even vaguely) this can help narrow down the problem significantly.

  • A hex dump is the only unambiguous representation, but please don't overdo it -- a few lines of sample data should usually suffice.

Common Questions


1 When talking about encoding, hex representations are often used since they are more concise -- 0xE2 is the hex representation of the byte 11100010.

2 The American National Standards Institute has standardized some character sets (notably ASCII; ANSI standard ANSI X3.4-1986) and text display formatting codes, but certainly not the Microsoft Windows code pages or the mechanism for how one of them is selected.

See Also

15132 questions
363
votes
19 answers

Change MySQL default character set to UTF-8 in my.cnf?

Currently we are using the following commands in PHP to set the character set to UTF-8 in our application. Since this is a bit of overhead, we'd like to set this as the default setting in MySQL. Can we do this in /etc/my.cnf or in another…
Jorre
  • 17,273
  • 32
  • 100
  • 145
332
votes
18 answers

Is there an upside down caret character?

I have to maintain a large number of classic ASP pages, many of which have tabular data with no sort capabilities at all. Whatever order the original developer used in the database query is what you're stuck with. I want to to tack on some basic…
Joel Coehoorn
  • 399,467
  • 113
  • 570
  • 794
331
votes
26 answers

Detect encoding and make everything UTF-8

I'm reading out lots of texts from various RSS feeds and inserting them into my database. Of course, there are several different character encodings used in the feeds, e.g. UTF-8 and ISO 8859-1. Unfortunately, there are sometimes problems with the…
caw
  • 30,999
  • 61
  • 181
  • 291
298
votes
8 answers

What encoding/code page is cmd.exe using?

When I open cmd.exe on Windows, what encoding is it using? How can I check which encoding it is currently using? Does it depend on my regional setting or are there any environment variables to check? What happens when you type a file with a certain…
Dan Gøran Lunde
  • 5,148
  • 3
  • 26
  • 24
295
votes
10 answers

What is a vertical tab?

What was the original historical use of the vertical tab character (\v in the C language, ASCII 11)? Did it ever have a key on a keyboard? How did someone generate it? Is there any language or system still in use today where the vertical tab…
dmazzoni
  • 12,866
  • 4
  • 38
  • 34
293
votes
13 answers

How to convert Strings to and from UTF8 byte arrays in Java

In Java, I have a String and I want to encode it as a byte array (in UTF8, or some other encoding). Alternately, I have a byte array (in some known encoding) and I want to convert it into a Java String. How do I do these conversions?
mcherm
  • 23,999
  • 10
  • 44
  • 50
273
votes
10 answers

What is ANSI format?

What is ANSI encoding format? Is it a system default format? In what way does it differ from ASCII?
web dunia
  • 9,381
  • 18
  • 52
  • 64
268
votes
19 answers

How do you echo a 4-digit Unicode character in Bash?

I'd like to add the Unicode skull and crossbones to my shell prompt (specifically the 'SKULL AND CROSSBONES' (U+2620)), but I can't figure out the magic incantation to make echo spit it, or any other, 4-digit Unicode character. Two-digit one's are…
masukomi
  • 10,313
  • 10
  • 40
  • 49
260
votes
11 answers

PHP DOMDocument loadHTML not encoding UTF-8 correctly

I'm trying to parse some HTML using DOMDocument, but when I do, I suddenly lose my encoding (at least that is how it appears to me). $profile = "

various japanese characters

"; $dom = new DOMDocument(); $dom->loadHTML($profile);…
Slightly A.
  • 2,795
  • 2
  • 16
  • 10
250
votes
8 answers

Writing Unicode text to a text file?

I'm pulling data out of a Google doc, processing it, and writing it to a file (that eventually I will paste into a Wordpress page). It has some non-ASCII symbols. How can I convert these safely to symbols that can be used in HTML source? Currently…
simon
  • 5,987
  • 13
  • 31
  • 28
247
votes
8 answers

Write to UTF-8 file in Python

I'm really confused with the codecs.open function. When I do: file = codecs.open("temp", "w", "utf-8") file.write(codecs.BOM_UTF8) file.close() It gives me the error UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal…
John Jiang
  • 11,069
  • 12
  • 51
  • 60
237
votes
15 answers

Do I really need to encode '&' as '&'?

I'm using an '&' symbol with HTML5 and UTF-8 in my site's . Google shows the ampersand fine on its SERPs, as do all the browsers in their titles. http://validator.w3.org is giving me this: & did not start a character reference. (& probably…</div> <div class="grid ai-start jc-space-between fw-wrap"> <div class="grid gs4 fw-wrap tags "> <a href="../../questions/tagged/validation" class="post-tag grid--cell" title="show questions tagged 'validation'" rel="tag">validation</a> <a href="../../questions/tagged/html" class="post-tag grid--cell" title="show questions tagged 'html'" rel="tag">html</a> <a href="../../questions/tagged/utf-8" class="post-tag grid--cell" title="show questions tagged 'utf-8'" rel="tag">utf-8</a> <a href="../../questions/tagged/character-encoding" class="post-tag grid--cell" title="show questions tagged 'character-encoding'" rel="tag">character-encoding</a> </div> <div class="started mt0"> <div class="s-user-card s-user-card"> <time class="s-user-card--time" datetime="asked Aug 16 '10 at 13:09">asked Aug 16 '10 at 13:09</time> <a href="../../users/289666/haroldo" class="s-avatar s-avatar__32 s-user-card--avatar"> <img class="s-avatar--image" src="../../users/profiles/289666.webp" data-jdenticon-width="32" data-jdenticon-height="32" data-jdenticon-value="Haroldo" /> </a> <div class="s-user-card--info"> <a href="../../users/289666/haroldo" class="s-user-card--link">Haroldo</a> <ul class="s-user-card--awards"> <li class="s-user-card--rep" title="reputation score">36,607</li> <li class="s-award-bling s-award-bling__gold" title="46 gold badges">46</li> <li class="s-award-bling s-award-bling__silver" title="127 silver badges">127</li> <li class="s-award-bling s-award-bling__bronze" title="169 bronze badges">169</li> </ul> </div> </div> </div> </div> </div> </div> </div> <div class="mln24"> <div class="question-summary" id="question-summary-1684040"> <div class="statscontainer"> <div class="stats"> <div class="vote"> <div class="votes"> <span class="vote-count-post"><strong>214</strong></span> <div class="viewcount">votes</div> </div> </div> <div class="status answered-accepted"> <strong>6</strong> answers </div> </div> </div> <div class="summary"> <h3><a href="../../questions/1684040/why-charset-names-are-not-constants" class="question-hyperlink">Why charset names are not constants?</a></h3> <div class="excerpt">Charset issues are confusing and complicated by themselves, but on top of that you have to remember exact names of your charsets. Is it "utf8"? Or "utf-8"? Or maybe "UTF-8"? When searching internet for code samples you will see all of the above. Why…</div> <div class="grid ai-start jc-space-between fw-wrap"> <div class="grid gs4 fw-wrap tags "> <a href="../../questions/tagged/java" class="post-tag grid--cell" title="show questions tagged 'java'" rel="tag">java</a> <a href="../../questions/tagged/character-encoding" class="post-tag grid--cell" title="show questions tagged 'character-encoding'" rel="tag">character-encoding</a> </div> <div class="started mt0"> <div class="s-user-card s-user-card"> <time class="s-user-card--time" datetime="asked Nov 05 '09 at 22:18">asked Nov 05 '09 at 22:18</time> <a href="../../users/20128/serg" class="s-avatar s-avatar__32 s-user-card--avatar"> <img class="s-avatar--image" src="../../users/profiles/20128.webp" data-jdenticon-width="32" data-jdenticon-height="32" data-jdenticon-value="serg" /> </a> <div class="s-user-card--info"> <a href="../../users/20128/serg" class="s-user-card--link">serg</a> <ul class="s-user-card--awards"> <li class="s-user-card--rep" title="reputation score">109,619</li> <li class="s-award-bling s-award-bling__gold" title="77 gold badges">77</li> <li class="s-award-bling s-award-bling__silver" title="317 silver badges">317</li> <li class="s-award-bling s-award-bling__bronze" title="330 bronze badges">330</li> </ul> </div> </div> </div> </div> </div> </div> </div> <div class="mln24"> <div class="question-summary" id="question-summary-30082741"> <div class="statscontainer"> <div class="stats"> <div class="vote"> <div class="votes"> <span class="vote-count-post"><strong>204</strong></span> <div class="viewcount">votes</div> </div> </div> <div class="status answered-accepted"> <strong>3</strong> answers </div> </div> </div> <div class="summary"> <h3><a href="../../questions/30082741/change-the-encoding-of-a-file-in-visual-studio-code" class="question-hyperlink">Change the encoding of a file in Visual Studio Code</a></h3> <div class="excerpt">Is there any way to change the encoding of a file? For example UTF-8 to ISO 8859-1? Setting Example Sublime Text: "default_encoding": "UTF-8" </div> <div class="grid ai-start jc-space-between fw-wrap"> <div class="grid gs4 fw-wrap tags "> <a href="../../questions/tagged/character-encoding" class="post-tag grid--cell" title="show questions tagged 'character-encoding'" rel="tag">character-encoding</a> <a href="../../questions/tagged/visual-studio-code" class="post-tag grid--cell" title="show questions tagged 'visual-studio-code'" rel="tag">visual-studio-code</a> </div> <div class="started mt0"> <div class="s-user-card s-user-card"> <time class="s-user-card--time" datetime="asked May 06 '15 at 16:43">asked May 06 '15 at 16:43</time> <a href="../../users/2215109/fernando-tholl" class="s-avatar s-avatar__32 s-user-card--avatar"> <img class="s-avatar--image" src="../../users/profiles/2215109.webp" data-jdenticon-width="32" data-jdenticon-height="32" data-jdenticon-value="Fernando Tholl" /> </a> <div class="s-user-card--info"> <a href="../../users/2215109/fernando-tholl" class="s-user-card--link">Fernando Tholl</a> <ul class="s-user-card--awards"> <li class="s-user-card--rep" title="reputation score">2,587</li> <li class="s-award-bling s-award-bling__gold" title="2 gold badges">2</li> <li class="s-award-bling s-award-bling__silver" title="16 silver badges">16</li> <li class="s-award-bling s-award-bling__bronze" title="14 bronze badges">14</li> </ul> </div> </div> </div> </div> </div> </div> </div> <div class="mln24"> <div class="question-summary" id="question-summary-2365411"> <div class="statscontainer"> <div class="stats"> <div class="vote"> <div class="votes"> <span class="vote-count-post"><strong>200</strong></span> <div class="viewcount">votes</div> </div> </div> <div class="status answered-accepted"> <strong>12</strong> answers </div> </div> </div> <div class="summary"> <h3><a href="../../questions/2365411/convert-unicode-to-ascii-without-errors-in-python" class="question-hyperlink">Convert Unicode to ASCII without errors in Python</a></h3> <div class="excerpt">My code just scrapes a web page, then converts it to Unicode. html = urllib.urlopen(link).read() html.encode("utf8","ignore") self.response.out.write(html) But I get a UnicodeDecodeError: Traceback (most recent call last): File…</div> <div class="grid ai-start jc-space-between fw-wrap"> <div class="grid gs4 fw-wrap tags "> <a href="../../questions/tagged/python" class="post-tag grid--cell" title="show questions tagged 'python'" rel="tag">python</a> <a href="../../questions/tagged/unicode" class="post-tag grid--cell" title="show questions tagged 'unicode'" rel="tag">unicode</a> <a href="../../questions/tagged/utf-8" class="post-tag grid--cell" title="show questions tagged 'utf-8'" rel="tag">utf-8</a> <a href="../../questions/tagged/character-encoding" class="post-tag grid--cell" title="show questions tagged 'character-encoding'" rel="tag">character-encoding</a> <a href="../../questions/tagged/ascii" class="post-tag grid--cell" title="show questions tagged 'ascii'" rel="tag">ascii</a> </div> <div class="started mt0"> <div class="s-user-card s-user-card"> <time class="s-user-card--time" datetime="asked Mar 02 '10 at 17:52">asked Mar 02 '10 at 17:52</time> <a href="../../users/279695/themirror" class="s-avatar s-avatar__32 s-user-card--avatar"> <img class="s-avatar--image" src="../../users/profiles/279695.webp" data-jdenticon-width="32" data-jdenticon-height="32" data-jdenticon-value="themirror" /> </a> <div class="s-user-card--info"> <a href="../../users/279695/themirror" class="s-user-card--link">themirror</a> <ul class="s-user-card--awards"> <li class="s-user-card--rep" title="reputation score">9,963</li> <li class="s-award-bling s-award-bling__gold" title="7 gold badges">7</li> <li class="s-award-bling s-award-bling__silver" title="46 silver badges">46</li> <li class="s-award-bling s-award-bling__bronze" title="79 bronze badges">79</li> </ul> </div> </div> </div> </div> </div> </div> </div> <div class="s-pagination pager fr"> <a class="s-pagination--item" href="../../questions/tagged/character-encoding_page=1" rel="prev" title="Go to page 1">Prev </a> <a class="s-pagination--item" href="../../questions/tagged/character-encoding_page=1" rel="" title="Go to page 1">1</a> <div class="s-pagination--item is-selected">2</div> <a class="s-pagination--item" href="../../questions/tagged/character-encoding_page=3" rel="" title="Go to page 3">3</a> <div class="s-pagination--item s-pagination--item__clear">…</div> <a class="s-pagination--item" href="../../questions/tagged/character-encoding_page=99" rel="" title="Go to page 99">99</a> <a class="s-pagination--item" href="../../questions/tagged/character-encoding_page=100" rel="" title="Go to page 100">100</a> <a class="s-pagination--item" href="../../questions/tagged/character-encoding_page=3" rel="next" title="Go to page 3"> Next</a> </div> </div> </div> </div> </div> <script src="../../static/js/stack-icons.js"></script> <script src="../../static/js/fromnow.js"></script> </body> </html>