Questions tagged [character-encoding]

Character encoding refers to the way characters are represented as a series of bytes. Character encoding for the Web is defined in the Encoding Standard.

Character encoding is the act or result of representing characters (human-readable text/symbols such as a or or ) as a series of bytes (computer-readable zeroes and ones).

Briefly, just like changing the font from Arial to Wingdings changes what your text looks like, changing encodings affects the interpretation of a sequence of bytes. For example, depending on the encoding, the bytes1 0xE2 0x89 0xA0 could represent the text ≠in Windows code page 1252, or Б┴═ in KOI8-R, or the character in UTF-8.

A useful reference is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

The Encoding Standard at https://encoding.spec.whatwg.org/ defines character encoding for the Web. It mandates the use of UTF-8 on the Web, and defines other encodings as legacy/obsolete.

Of course, if the file you are looking at does not contain text, that means it does not encode any characters, and thus, character encoding is not meaningful or well-defined. A common beginner problem is trying to read a binary file as text and being surprised that you get a character encoding error. But the fix in this situation is to read the file in binary mode instead. For example, many office document, audio, video, and image formats, and proprietary file formats are binary files.

How Can I Fix the Encoding?

If you are a beginner who just needs to fix an acute problem with a text file, see if your text editor provides an option to save a file in a different encoding. Understand that not all encodings can accommodate all characters (so, for example, Windows code page 1252 cannot save text which contains Chinese or Russian characters, emoji, etc) or, if you know the current encoding and what you want to change it into, try a tool like iconv or GNU recode.

Which Character Encoding is This?

Questions asking for help identifying or manipulating text in a particular encoding are frequent, but often problematic. Please include enough information to help us help you.

Bad: "I look at the text and I see óòÒöô, what is this"?

Good: "I have text in an unknown encoding in a file. I cannot view this text in UTF-8, but when I set my system to use ISO-8859-1, I see óòÒöô. I know this isn't right; the text is supposed to be <text> in <language>. A hex dump of the beginning of the file shows

    000000 9e 9f 9a a0 af b4 be f0  9e af b3 f2 20 b7 5f 20

Bad: Anything which tries to use the term "ANSI" in this context2

Legacy Microsoft Windows documentation misleadingly uses "ANSI" to refer to whichever character set is the default for the current locale. But this is a moving target; now, we have to guess your current locale, too.

Better: Specify the precise code page

Commonly on Western Windows installations, you will be using CP-1252; but of course, if you have to guess, you need to say so, too.

Notice:

  • We cannot guess which encoding you are using to look at the mystery data. Please include this information if you are genuinely trying to tell us what you see.

  • A copy/paste is rarely sufficient, because this introduces several additional variables (we will need to correctly guess about your web browser's handling of the text, too, and the web server's, and the tool you used to obtain a copy of the text, and so forth).

  • If you know what the text is supposed to represent (even vaguely) this can help narrow down the problem significantly.

  • A hex dump is the only unambiguous representation, but please don't overdo it -- a few lines of sample data should usually suffice.

Common Questions


1 When talking about encoding, hex representations are often used since they are more concise -- 0xE2 is the hex representation of the byte 11100010.

2 The American National Standards Institute has standardized some character sets (notably ASCII; ANSI standard ANSI X3.4-1986) and text display formatting codes, but certainly not the Microsoft Windows code pages or the mechanism for how one of them is selected.

See Also

15132 questions
105
votes
5 answers

Python: Converting from ISO-8859-1/latin1 to UTF-8

I have this string that has been decoded from Quoted-printable to ISO-8859-1 with the email module. This gives me strings like "\xC4pple" which would correspond to "Äpple" (Apple in Swedish). However, I can't convert those strings to UTF-8. >>>…
Zyberzero
  • 1,604
  • 2
  • 15
  • 35
103
votes
8 answers

How to reliably guess the encoding between MacRoman, CP1252, Latin1, UTF-8, and ASCII

At work it seems like no week ever passes without some encoding-related conniption, calamity, or catastrophe. The problem usually derives from programmers who think they can reliably process a “text” file without specifying the encoding. But you…
tchrist
  • 78,834
  • 30
  • 123
  • 180
102
votes
3 answers

Writing a connection string when password contains special characters

I'm using SQLalchemy for a Python project, and I want to have a tidy connection string to access my database. So for example: engine = create_engine('postgresql://user:pass@host/database') The problem is my password contains a sequence of special…
101
votes
7 answers

'git log' output encoding issues in Windows 10 CLI terminal

Problem How can I make git log command output properly displayed in the Windows CLI terminal? Example As you can see, I can type diacritical characters properly, but on git log, the output is somehow escaped. According to the UTF-8 encoding table,…
98
votes
3 answers

How to get a char from an ASCII Character Code in C#

I'm trying to parse a file in C# that has field (string) arrays separated by ASCII character codes 0, 1 and 2 (in Visual Basic 6 you can generate these by using Chr(0) or Chr(1) etc.) I know that for character code 0 in C# you can do the…
Jimbo
  • 22,379
  • 42
  • 117
  • 159
98
votes
6 answers

How to Find the Default Charset/Encoding in Java?

The obvious answer is to use Charset.defaultCharset() but we recently found out that this might not be the right answer. I was told that the result is different from real default charset used by java.io classes in several occasions. Looks like Java…
ZZ Coder
  • 74,484
  • 29
  • 137
  • 169
96
votes
4 answers

What is the maximum number of bytes for a UTF-8 encoded character?

What is the maximum number of bytes for a single UTF-8 encoded character? I'll be encrypting the bytes of a String encoded in UTF-8 and therefore need to be able to work out the maximum number of bytes for a UTF-8 encoded String. Could someone…
Edd
  • 8,402
  • 14
  • 47
  • 73
96
votes
20 answers

FPDF utf-8 encoding (HOW-TO)

Does anybody know how to set the encoding in FPDF package to UTF-8? Or at least to ISO-8859-7 (Greek) that supports Greek characters? Basically I want to create a PDF file containing Greek characters. Any suggestions would help. George
yorgos
  • 1,730
  • 5
  • 20
  • 28
96
votes
8 answers

C programming: How can I program for Unicode?

What prerequisites are needed to do strict Unicode programming? Does this imply that my code should not use char types anywhere and that functions need to be used that can deal with wint_t and wchar_t? And what is the role played by multibyte…
prinzdezibel
  • 11,029
  • 17
  • 55
  • 62
96
votes
5 answers

Difference between Url Encode and HTML encode

What’s the difference between an URL Encode and a HTML Encode?
Quintin Par
  • 15,862
  • 27
  • 93
  • 146
96
votes
3 answers

Java: Converting String to and from ByteBuffer and associated problems

I am using Java NIO for my socket connections, and my protocol is text based, so I need to be able to convert Strings to ByteBuffers before writing them to the SocketChannel, and convert the incoming ByteBuffers back to Strings. Currently, I am…
DivideByHero
  • 19,715
  • 24
  • 57
  • 64
95
votes
4 answers

How to remove non UTF-8 characters from text file

I have a bunch of Arabic, English, Russian files which are encoded in utf-8. Trying to process these files using a Perl script, I get this error: Malformed UTF-8 character (fatal) Manually checking the content of these files, I found some strange…
Hakim
  • 11,110
  • 14
  • 34
  • 37
93
votes
9 answers

HMAC-SHA256 Algorithm for signature calculation

I am trying to create a signature using the HMAC-SHA256 algorithm and this is my code. I am using US ASCII encoding. final Charset asciiCs = Charset.forName("US-ASCII"); final Mac sha256_HMAC = Mac.getInstance("HmacSHA256"); final SecretKeySpec…
Rishi
  • 1,331
  • 4
  • 13
  • 16
93
votes
14 answers

Save all files in Visual Studio project as UTF-8

I wonder if it's possible to save all files in a Visual Studio 2008 project into a specific character encoding. I got a solution with mixed encodings and I want to make them all the same (UTF-8 with signature). I know how to save single files, but…
jesperlind
  • 2,092
  • 2
  • 20
  • 22
91
votes
21 answers

PHP output showing little black diamonds with a question mark

I'm writing a php program that pulls from a database source. Some of the varchars have quotes that are displaying as black diamonds with a question mark in them (�, REPLACEMENT CHARACTER, I assume from Microsoft Word text). How can I use php to…
Steve Zajaczkowski