Questions tagged [character-encoding]

Character encoding refers to the way characters are represented as a series of bytes. Character encoding for the Web is defined in the Encoding Standard.

Character encoding is the act or result of representing characters (human-readable text/symbols such as a or or ) as a series of bytes (computer-readable zeroes and ones).

Briefly, just like changing the font from Arial to Wingdings changes what your text looks like, changing encodings affects the interpretation of a sequence of bytes. For example, depending on the encoding, the bytes1 0xE2 0x89 0xA0 could represent the text ≠in Windows code page 1252, or Б┴═ in KOI8-R, or the character in UTF-8.

A useful reference is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

The Encoding Standard at https://encoding.spec.whatwg.org/ defines character encoding for the Web. It mandates the use of UTF-8 on the Web, and defines other encodings as legacy/obsolete.

Of course, if the file you are looking at does not contain text, that means it does not encode any characters, and thus, character encoding is not meaningful or well-defined. A common beginner problem is trying to read a binary file as text and being surprised that you get a character encoding error. But the fix in this situation is to read the file in binary mode instead. For example, many office document, audio, video, and image formats, and proprietary file formats are binary files.

How Can I Fix the Encoding?

If you are a beginner who just needs to fix an acute problem with a text file, see if your text editor provides an option to save a file in a different encoding. Understand that not all encodings can accommodate all characters (so, for example, Windows code page 1252 cannot save text which contains Chinese or Russian characters, emoji, etc) or, if you know the current encoding and what you want to change it into, try a tool like iconv or GNU recode.

Which Character Encoding is This?

Questions asking for help identifying or manipulating text in a particular encoding are frequent, but often problematic. Please include enough information to help us help you.

Bad: "I look at the text and I see óòÒöô, what is this"?

Good: "I have text in an unknown encoding in a file. I cannot view this text in UTF-8, but when I set my system to use ISO-8859-1, I see óòÒöô. I know this isn't right; the text is supposed to be <text> in <language>. A hex dump of the beginning of the file shows

    000000 9e 9f 9a a0 af b4 be f0  9e af b3 f2 20 b7 5f 20

Bad: Anything which tries to use the term "ANSI" in this context2

Legacy Microsoft Windows documentation misleadingly uses "ANSI" to refer to whichever character set is the default for the current locale. But this is a moving target; now, we have to guess your current locale, too.

Better: Specify the precise code page

Commonly on Western Windows installations, you will be using CP-1252; but of course, if you have to guess, you need to say so, too.

Notice:

  • We cannot guess which encoding you are using to look at the mystery data. Please include this information if you are genuinely trying to tell us what you see.

  • A copy/paste is rarely sufficient, because this introduces several additional variables (we will need to correctly guess about your web browser's handling of the text, too, and the web server's, and the tool you used to obtain a copy of the text, and so forth).

  • If you know what the text is supposed to represent (even vaguely) this can help narrow down the problem significantly.

  • A hex dump is the only unambiguous representation, but please don't overdo it -- a few lines of sample data should usually suffice.

Common Questions


1 When talking about encoding, hex representations are often used since they are more concise -- 0xE2 is the hex representation of the byte 11100010.

2 The American National Standards Institute has standardized some character sets (notably ASCII; ANSI standard ANSI X3.4-1986) and text display formatting codes, but certainly not the Microsoft Windows code pages or the mechanism for how one of them is selected.

See Also

15132 questions
192
votes
7 answers

How can I transform string to UTF-8 in C#?

I have a string that I receive from a third party app and I would like to display it correctly in any language using C# on my Windows Surface. Due to incorrect encoding, a piece of my string looks like this in Spanish: Acción whereas it should…
Gaara
  • 2,117
  • 2
  • 15
  • 15
190
votes
10 answers

What's the difference between encoding and charset?

I am confused about the text encoding and charset. For many reasons, I have to learn non-Unicode, non-UTF8 stuff in my upcoming work. I find the word "charset" in email headers as in "ISO-2022-JP", but there's no such a encoding in text editors. (I…
TK.
  • 27,073
  • 20
  • 64
  • 72
185
votes
7 answers

What is the difference between encode/decode?

I've never been sure that I understand the difference between str/unicode decode and encode. I know that str().decode() is for when you have a string of bytes that you know has a certain character encoding, given that encoding name it will return a…
ʞɔıu
  • 47,148
  • 35
  • 106
  • 149
183
votes
4 answers

Why specify @charset "UTF-8"; in your CSS file?

I've been seeing this instruction as the very first line of numerous CSS files that have been turned over to me: @charset "UTF-8"; What does it do, and is this at-rule necessary? Also, if I include this meta tag in my "head" element, would that…
rsturim
  • 6,756
  • 15
  • 47
  • 59
174
votes
3 answers

Changing PowerShell's default output encoding to UTF-8

By default, when you redirect the output of a command to a file or pipe it into something else in PowerShell, the encoding is UTF-16, which isn't useful. I'm looking to change it to UTF-8. It can be done on a case-by-case basis by replacing the…
rwallace
  • 31,405
  • 40
  • 123
  • 242
172
votes
12 answers

PHP: Convert any string to UTF-8 without knowing the original character set, or at least try

I have an application that deals with clients from all over the world, and, naturally, I want everything going into my databases to be UTF-8 encoded. The main problem for me is that I don't know what encoding the source of any string is going to be…
Grim...
  • 16,518
  • 7
  • 45
  • 61
170
votes
10 answers

Can I make git recognize a UTF-16 file as text?

I'm tracking a Virtual PC virtual machine file (*.vmc) in git, and after making a change git identified the file as binary and wouldn't diff it for me. I discovered that the file was encoded in UTF-16. Can git be taught to recognize that this file…
skiphoppy
  • 97,646
  • 72
  • 174
  • 218
169
votes
23 answers

How do I remove  from the beginning of a file?

I have a CSS file that looks fine when I open it using gedit, but when it's read by PHP (to merge all the CSS files into one), this CSS has the following characters prepended to it:  PHP removes all whitespace, so a random  in the middle of…
Matt
  • 11,157
  • 26
  • 81
  • 110
161
votes
10 answers

How many characters can UTF-8 encode?

If UTF-8 is 8 bits, does it not mean that there can be only maximum of 256 different characters? The first 128 code points are the same as in ASCII. But it says UTF-8 can support up to million of characters? How does this work?
eMRe
  • 3,097
  • 5
  • 34
  • 51
158
votes
16 answers

Java : How to determine the correct charset encoding of a stream

With reference to the following thread: Java App : Unable to read iso-8859-1 encoded file correctly What is the best way to programatically determine the correct charset encoding of an inputstream/file ? I have tried using the following: File in = …
Joel
  • 29,538
  • 35
  • 110
  • 138
151
votes
13 answers

How to change the default encoding to UTF-8 for Apache

I am using a hosting company and it will list the files in a directory if the file index.html is not there. It uses ISO 8859-1 as the default encoding. If the server is Apache, is there a way to set UTF-8 as the default instead? I found out that it…
nonopolarity
  • 146,324
  • 131
  • 460
  • 740
147
votes
10 answers

How can I find non-ASCII characters in MySQL?

I'm working with a MySQL database that has some data imported from Excel. The data contains non-ASCII characters (em dashes, etc.) as well as hidden carriage returns or line feeds. Is there a way to find these records using MySQL?
Ed Mays
  • 1,730
  • 4
  • 13
  • 12
147
votes
5 answers

JsonParseException : Illegal unquoted character ((CTRL-CHAR, code 10)

I'm trying to use org.apache.httpcomponents to consume a Rest API, which will post JSON format data to API. I get this exception: Caused by: com.fasterxml.jackson.core.JsonParseException: Illegal unquoted character ((CTRL-CHAR, code 10)): has to…
jian zhong
  • 1,471
  • 2
  • 9
  • 4
144
votes
14 answers

How to check if a String contains only ASCII?

The call Character.isLetter(c) returns true if the character is a letter. But is there a way to quickly find if a String only contains the base characters of ASCII?
TambourineMan
  • 1,443
  • 2
  • 9
  • 5
144
votes
7 answers

Is ASCII code in matter of fact 7 bit or 8 bit?

My teacher told me ASCII is an 8-bit character coding scheme. But it is defined only for 0-127 codes which means it can be fitted into 7 bits. So can't it be argued that ASCII is actually a 7-bit code? And what do we mean to say at all when saying…
Anurag Kalia
  • 4,668
  • 4
  • 21
  • 28