Questions tagged [character-encoding]

Character encoding refers to the way characters are represented as a series of bytes. Character encoding for the Web is defined in the Encoding Standard.

Character encoding is the act or result of representing characters (human-readable text/symbols such as a or or ) as a series of bytes (computer-readable zeroes and ones).

Briefly, just like changing the font from Arial to Wingdings changes what your text looks like, changing encodings affects the interpretation of a sequence of bytes. For example, depending on the encoding, the bytes1 0xE2 0x89 0xA0 could represent the text ≠in Windows code page 1252, or Б┴═ in KOI8-R, or the character in UTF-8.

A useful reference is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

The Encoding Standard at https://encoding.spec.whatwg.org/ defines character encoding for the Web. It mandates the use of UTF-8 on the Web, and defines other encodings as legacy/obsolete.

Of course, if the file you are looking at does not contain text, that means it does not encode any characters, and thus, character encoding is not meaningful or well-defined. A common beginner problem is trying to read a binary file as text and being surprised that you get a character encoding error. But the fix in this situation is to read the file in binary mode instead. For example, many office document, audio, video, and image formats, and proprietary file formats are binary files.

How Can I Fix the Encoding?

If you are a beginner who just needs to fix an acute problem with a text file, see if your text editor provides an option to save a file in a different encoding. Understand that not all encodings can accommodate all characters (so, for example, Windows code page 1252 cannot save text which contains Chinese or Russian characters, emoji, etc) or, if you know the current encoding and what you want to change it into, try a tool like iconv or GNU recode.

Which Character Encoding is This?

Questions asking for help identifying or manipulating text in a particular encoding are frequent, but often problematic. Please include enough information to help us help you.

Bad: "I look at the text and I see óòÒöô, what is this"?

Good: "I have text in an unknown encoding in a file. I cannot view this text in UTF-8, but when I set my system to use ISO-8859-1, I see óòÒöô. I know this isn't right; the text is supposed to be <text> in <language>. A hex dump of the beginning of the file shows

    000000 9e 9f 9a a0 af b4 be f0  9e af b3 f2 20 b7 5f 20

Bad: Anything which tries to use the term "ANSI" in this context2

Legacy Microsoft Windows documentation misleadingly uses "ANSI" to refer to whichever character set is the default for the current locale. But this is a moving target; now, we have to guess your current locale, too.

Better: Specify the precise code page

Commonly on Western Windows installations, you will be using CP-1252; but of course, if you have to guess, you need to say so, too.

Notice:

  • We cannot guess which encoding you are using to look at the mystery data. Please include this information if you are genuinely trying to tell us what you see.

  • A copy/paste is rarely sufficient, because this introduces several additional variables (we will need to correctly guess about your web browser's handling of the text, too, and the web server's, and the tool you used to obtain a copy of the text, and so forth).

  • If you know what the text is supposed to represent (even vaguely) this can help narrow down the problem significantly.

  • A hex dump is the only unambiguous representation, but please don't overdo it -- a few lines of sample data should usually suffice.

Common Questions


1 When talking about encoding, hex representations are often used since they are more concise -- 0xE2 is the hex representation of the byte 11100010.

2 The American National Standards Institute has standardized some character sets (notably ASCII; ANSI standard ANSI X3.4-1986) and text display formatting codes, but certainly not the Microsoft Windows code pages or the mechanism for how one of them is selected.

See Also

15132 questions
5
votes
2 answers

PHP charset accents issue

I have a form in my page for users to leave a comment. I'm currently using this charset: meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1" but retrieveving the comment from DB accents are not displaying correct ( Ex. è =>è…
luca
  • 36,606
  • 27
  • 86
  • 125
5
votes
1 answer

Are there correct encodings for the backslash and tilde characters in Shift_JIS?

Or do these two characters simply not exist in Shift_JIS? The first 128 characters in the Shift_JIS character encoding scheme match ASCII except for two: 0x5C is a Yen symbol (¥) instead of a backslash, and 0x7E is an overline (‾) instead of a…
kshetline
  • 12,547
  • 4
  • 37
  • 73
5
votes
3 answers

MockMvc: changing default character encoding of MockHttpServletResponse from ISO-8859-1 to UTF-8

While writing Spring Itegration Tests I had the problem that MockMvc ignored my .accept(MediaType.APPLICATION_JSON_UTF8) setting, and returned ISO-8859-1 with bad looking umlaut. What is the best way to set default encoding of MockMvc to UTF-8?
R.A
  • 1,813
  • 21
  • 29
5
votes
3 answers

How to read first n bytes of buffer and convert to string in NodeJS?

I have a string that was sent over a network and arrived on my server as a Buffer. It has been formatted to my own custom protocol (in theory, haven't implemented yet). I wanted to use the first n bytes for a string that will identify the protocol.…
5
votes
1 answer

Java XMLReader not clearing multi-byte UTF-8 encoded attributes

I've got a really strange situation where my SAX ContentHandler is being handed bad Attributes by XMLReader. The document being parsed is UTF-8 with multi-byte characters inside XML attributes. What appears to happen is that these attributes are…
mckamey
  • 17,359
  • 16
  • 83
  • 116
5
votes
4 answers

How to convert (char *) from ISO-8859-1 to UTF-8 in C++ multiplatformly?

I'm changing a software in C++, wich process texts in ISO Latin 1 format, to store data in a database in SQLite. The problem is that SQLite works in UTF-8... and the Java modules that use same database work in UTF-8. I wanted to have a way to…
gabriel
  • 199
  • 1
  • 2
  • 10
5
votes
1 answer

url encoded character gets parsed wrongly by webflow/EL/JSF

when I submit the character Ö from a webpage the backend recieves Ã. The webpage is part of a Spring Webflow/JSF1.2/Facelets application. When I inspect the POST with firebug I see: Content-Type: application/x-www-form-urlencoded Content-Length: 74…
Nicolas Mommaerts
  • 3,207
  • 4
  • 35
  • 55
5
votes
2 answers

Maximum UTF-8 string size given UTF-16 size

What is the formula for determining the maximum number of UTF-8 bytes required to encode a given number of UTF-16 code units (i.e. the value of String.Length in C# / .NET)? I see 3 possibilities: # of UTF-16 code units x 2 # of UTF-16 code units x…
Mike Marynowski
  • 3,156
  • 22
  • 32
5
votes
3 answers

How do browsers handle tag that specifies the character-encoding?

Suppose a browser encounters a tag that specifies the character-encoding, like this: Does it start over from the beginning parsing the page again, since some of the…
Joel Lee
  • 3,656
  • 1
  • 18
  • 21
5
votes
2 answers

How can I get the charset of a string/buffer?

I need an elisp function that guesses the charset of some html, and since Emacs already does that when opening a file, I wonder if I can reuse it somehow, perhaps by writing the string in a temporary buffer, setting the correct charset, and getting…
konr
  • 2,545
  • 2
  • 20
  • 38
5
votes
5 answers

Where to put PHP encoding type header?

I am learning about handling UTF8 character sets and the recommendation is to explicitly set the encoding type in the output headers within your PHP script like so: header('Content-Type: text/html; charset=utf-8'); My question is about where I…
Sherri
  • 816
  • 2
  • 9
  • 18
5
votes
3 answers

Export csv with ISO-8859-1 encoding instead of UTF-8

I struggle with encoding in csv exports. I'm from the Netherlands and we use quite some trema's (e.g. ë, ï) and accents (e.g. é, ó) etc. This causes troubles when exporting to csv and open file in excel. On macOS Mojave. I've tried multiple encoding…
Tdebeus
  • 1,519
  • 5
  • 21
  • 43
5
votes
3 answers

java.util.Scanner to read files with different character encoding

I use Java to read list of files. Some of these has different encoding, ANSI instead of UTF-8. java.util.Scanner is unable to read these files and get empty output string. I tried another approach: FileInputStream fis = new…
plaidshirt
  • 5,189
  • 19
  • 91
  • 181
5
votes
4 answers

Representing 3 Integers Using One Byte?

I have three integers {a, b, c} that range (say) between the following values: a - {1 to 120, in jumps of 1} b - {-100 to 100, in jumps of 5} c - {1 to 10, in jumps of 1} Due to space considerations, I would like to represent these three values…
user3262424
  • 7,223
  • 16
  • 54
  • 84
5
votes
2 answers

How to detect text file encoding in objective-c?

I want to know the text file encoding in objective-c. Can you explain me how to know that?
Rizki
  • 281
  • 1
  • 6
  • 19