Questions tagged [character-encoding]

Character encoding refers to the way characters are represented as a series of bytes. Character encoding for the Web is defined in the Encoding Standard.

Character encoding is the act or result of representing characters (human-readable text/symbols such as a or 汉 or ) as a series of bytes (computer-readable zeroes and ones).

Briefly, just like changing the font from Arial to Wingdings changes what your text looks like, changing encodings affects the interpretation of a sequence of bytes. For example, depending on the encoding, the bytes¹ 0xE2 0x89 0xA0 could represent the text â‰ in Windows code page 1252, or Б┴═ in KOI8-R, or the character ≠ in UTF-8.

A useful reference is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

The Encoding Standard at https://encoding.spec.whatwg.org/ defines character encoding for the Web. It mandates the use of UTF-8 on the Web, and defines other encodings as legacy/obsolete.

Of course, if the file you are looking at does not contain text, that means it does not encode any characters, and thus, character encoding is not meaningful or well-defined. A common beginner problem is trying to read a binary file as text and being surprised that you get a character encoding error. But the fix in this situation is to read the file in binary mode instead. For example, many office document, audio, video, and image formats, and proprietary file formats are binary files.

How Can I Fix the Encoding?

If you are a beginner who just needs to fix an acute problem with a text file, see if your text editor provides an option to save a file in a different encoding. Understand that not all encodings can accommodate all characters (so, for example, Windows code page 1252 cannot save text which contains Chinese or Russian characters, emoji, etc) or, if you know the current encoding and what you want to change it into, try a tool like iconv or GNU recode.

Which Character Encoding is This?

Questions asking for help identifying or manipulating text in a particular encoding are frequent, but often problematic. Please include enough information to help us help you.

Bad: "I look at the text and I see óòÒöô, what is this"?

Good: "I have text in an unknown encoding in a file. I cannot view this text in UTF-8, but when I set my system to use ISO-8859-1, I see óòÒöô. I know this isn't right; the text is supposed to be <text> in <language>. A hex dump of the beginning of the file shows

    000000 9e 9f 9a a0 af b4 be f0  9e af b3 f2 20 b7 5f 20

Bad: Anything which tries to use the term "ANSI" in this context²

Legacy Microsoft Windows documentation misleadingly uses "ANSI" to refer to whichever character set is the default for the current locale. But this is a moving target; now, we have to guess your current locale, too.

Better: Specify the precise code page

Commonly on Western Windows installations, you will be using CP-1252; but of course, if you have to guess, you need to say so, too.

Notice:

We cannot guess which encoding you are using to look at the mystery data. Please include this information if you are genuinely trying to tell us what you see.
A copy/paste is rarely sufficient, because this introduces several additional variables (we will need to correctly guess about your web browser's handling of the text, too, and the web server's, and the tool you used to obtain a copy of the text, and so forth).
If you know what the text is supposed to represent (even vaguely) this can help narrow down the problem significantly.
A hex dump is the only unambiguous representation, but please don't overdo it -- a few lines of sample data should usually suffice.

Common Questions

¹ When talking about encoding, hex representations are often used since they are more concise -- 0xE2 is the hex representation of the byte 11100010.

² The American National Standards Institute has standardized some character sets (notably ASCII; ANSI standard ANSI X3.4-1986) and text display formatting codes, but certainly not the Microsoft Windows code pages or the mechanism for how one of them is selected.

How to support UTF-8 encoding in Eclipse

How can I add UTF-8 support in eclipse? I want to add for example Russian language but eclipse won't support it. What should I do? Please guide me.

eclipse encoding utf-8 character-encoding

asked Feb 07 '12 at 17:35

Katty

1,707
3
13
18

142

votes

5 answers

How to change the default charset of a MySQL table?

There is a MySQL table which has this definition taken from SQLYog Enterprise : Table Create Table ----------------- …

mysql character-encoding collation sqlyog

asked Jan 18 '12 at 07:48

pheromix

18,213
29
88
158

140

votes

2 answers

How many bits or bytes are there in a character?

How many bits or bytes are there per "character"?

character-encoding byte

asked Jan 31 '11 at 11:17

RedKing

1,563
4
12
10

138

votes

16 answers

Who sets response content-type in Spring MVC (@ResponseBody)

I'm having in my Annotation driven Spring MVC Java web application runned on jetty web server (currently in maven jetty plugin). I'm trying to do some AJAX support with one controller method returning just String help text. Resources are in UTF-8…

java web-applications spring-mvc character-encoding

asked Sep 01 '10 at 08:49

Hurda

4,647
8
35
49

129

votes

3 answers

.NET Core doesn't know about Windows 1252, how to fix?

This program works just fine when compiled for .NET 4 but does not when compiled for .NET Core. I understand the error about encoding not supported but not how to fix it. Public Class Program Public Shared Function Main(ByVal args As String())…

vb.net character-encoding .net-core windows-1252

asked Jun 16 '16 at 22:02

Joshua

40,822
8
72
132

128

votes

3 answers

How does UTF-8 "variable-width encoding" work?

The unicode standard has enough code-points in it that you need 4 bytes to store them all. That's what the UTF-32 encoding does. Yet the UTF-8 encoding somehow squeezes these into much smaller spaces by using something called "variable-width…

unicode utf-8 character-encoding multibyte

asked Oct 09 '09 at 13:02

dsimard

4,245
5
22
16

126

votes

6 answers

How to set standard encoding in Visual Studio

I am searching for a way to setup Visual Studio so it always saves my files in UTF-8. I have only found options to set this project wide. Is there a way to set it Visual Studio wide?

visual-studio visual-studio-2008 encoding character-encoding

asked Mar 30 '09 at 09:55

Thomaschaaf

17,847
32
94
128

125

votes

5 answers

Meaning of -

I am new to XML and I am trying to understand the basics. I read the line below in "Learning XML", but it is still not clear, for me. Can someone point me to a book or website which explains these basics clearly? From Learning XML: The XML…

xml character-encoding xml-declaration xml-encoding

asked Dec 06 '12 at 12:03

XML Boy

1,363
2
9
9

122

votes

11 answers

All inclusive Charset to avoid "java.nio.charset.MalformedInputException: Input length = 1"?

I'm creating a simple wordcount program in Java that reads through a directory's text-based files. However, I keep on getting the error: java.nio.charset.MalformedInputException: Input length = 1 from this line of code: BufferedReader reader =…

java character-encoding

asked Oct 08 '14 at 23:41

Jonathan Lam

16,831
17
68
94

121

votes

11 answers

java.sql.SQLException: Incorrect string value: '\xF0\x9F\x91\xBD\xF0\x9F...'

I have the following string value: "walmart obama " I am using MySQL and Java. I am getting the following exception: `java.sql.SQLException: Incorrect string value: '\xF0\x9F\x91\xBD\xF0\x9F...' Here is the variable I am trying to insert into: var1…

java mysql encoding character-encoding sqlexception

asked Nov 30 '12 at 21:51

CodeKingPlusPlus

15,383
51
135
216

119

votes

3 answers

Is " " a replacement of " "?

In my ASP.NET application, I was trying to add few white spaces between two text boxes by typing space bar. The equivalent HTML source was instead of . So I just wanted to check: is this the new replacement for white space? If yes, any…

html asp.net visual-studio-2008 character-encoding

asked Jul 18 '10 at 04:29

Anto Varghese

3,131
6
31
38

116

votes

5 answers

What is the proper way to URL encode Unicode characters?

I know of the non-standard %uxxxx scheme but that doesn't seem like a wise choice since the scheme has been rejected by the W3C. Some interesting examples: The heart character. If I type this into my browser: http://www.google.com/search?q=♥ Then…

unicode utf-8 character-encoding urlencode web-standards

asked May 26 '09 at 21:18

Josh Gibson

21,808
28
67
63

110

votes

5 answers

Trouble with UTF-8 characters; what I see is not what I stored

I tried to use UTF-8 and ran into trouble. I have tried so many things; here are the results I have gotten: ???? instead of Asian characters. Even for European text, I got Se?or for Señor. Strange gibberish (Mojibake?) such as SeÃ±or or…

mysql unicode utf-8 character-encoding mariadb

asked Jul 14 '16 at 00:04

Rick James

135,179
13
127
222

110

votes

10 answers

Get a list of all the encodings Python can encode to

I am writing a script that will try encoding bytes into many different encodings in Python 2.6. Is there some way to get a list of available encodings that I can iterate over? The reason I'm trying to do this is because a user has some text that is…

python unicode encoding character-encoding

asked Nov 13 '09 at 10:24

Amandasaurus

58,203
71
188
248

109

votes

10 answers

Reading a UTF8 CSV file with Python

I am trying to read a CSV file with accented characters with Python (only French and/or Spanish characters). Based on the Python 2.5 documentation for the csvreader (http://docs.python.org/library/csv.html), I came up with the following code to read…

python utf-8 csv character-encoding

asked May 24 '09 at 15:56

Martin

39,309
62
192
278

Prev 1 2 3

…

99 100 Next

Questions tagged [character-encoding]

How Can I Fix the Encoding?

Which Character Encoding is This?

Common Questions

See Also