Questions tagged [character-encoding]

Character encoding refers to the way characters are represented as a series of bytes. Character encoding for the Web is defined in the Encoding Standard.

Character encoding is the act or result of representing characters (human-readable text/symbols such as a or 汉 or ) as a series of bytes (computer-readable zeroes and ones).

Briefly, just like changing the font from Arial to Wingdings changes what your text looks like, changing encodings affects the interpretation of a sequence of bytes. For example, depending on the encoding, the bytes¹ 0xE2 0x89 0xA0 could represent the text â‰ in Windows code page 1252, or Б┴═ in KOI8-R, or the character ≠ in UTF-8.

A useful reference is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

The Encoding Standard at https://encoding.spec.whatwg.org/ defines character encoding for the Web. It mandates the use of UTF-8 on the Web, and defines other encodings as legacy/obsolete.

Of course, if the file you are looking at does not contain text, that means it does not encode any characters, and thus, character encoding is not meaningful or well-defined. A common beginner problem is trying to read a binary file as text and being surprised that you get a character encoding error. But the fix in this situation is to read the file in binary mode instead. For example, many office document, audio, video, and image formats, and proprietary file formats are binary files.

How Can I Fix the Encoding?

If you are a beginner who just needs to fix an acute problem with a text file, see if your text editor provides an option to save a file in a different encoding. Understand that not all encodings can accommodate all characters (so, for example, Windows code page 1252 cannot save text which contains Chinese or Russian characters, emoji, etc) or, if you know the current encoding and what you want to change it into, try a tool like iconv or GNU recode.

Which Character Encoding is This?

Questions asking for help identifying or manipulating text in a particular encoding are frequent, but often problematic. Please include enough information to help us help you.

Bad: "I look at the text and I see óòÒöô, what is this"?

Good: "I have text in an unknown encoding in a file. I cannot view this text in UTF-8, but when I set my system to use ISO-8859-1, I see óòÒöô. I know this isn't right; the text is supposed to be <text> in <language>. A hex dump of the beginning of the file shows

    000000 9e 9f 9a a0 af b4 be f0  9e af b3 f2 20 b7 5f 20

Bad: Anything which tries to use the term "ANSI" in this context²

Legacy Microsoft Windows documentation misleadingly uses "ANSI" to refer to whichever character set is the default for the current locale. But this is a moving target; now, we have to guess your current locale, too.

Better: Specify the precise code page

Commonly on Western Windows installations, you will be using CP-1252; but of course, if you have to guess, you need to say so, too.

Notice:

We cannot guess which encoding you are using to look at the mystery data. Please include this information if you are genuinely trying to tell us what you see.
A copy/paste is rarely sufficient, because this introduces several additional variables (we will need to correctly guess about your web browser's handling of the text, too, and the web server's, and the tool you used to obtain a copy of the text, and so forth).
If you know what the text is supposed to represent (even vaguely) this can help narrow down the problem significantly.
A hex dump is the only unambiguous representation, but please don't overdo it -- a few lines of sample data should usually suffice.

Common Questions

¹ When talking about encoding, hex representations are often used since they are more concise -- 0xE2 is the hex representation of the byte 11100010.

² The American National Standards Institute has standardized some character sets (notably ASCII; ANSI standard ANSI X3.4-1986) and text display formatting codes, but certainly not the Microsoft Windows code pages or the mechanism for how one of them is selected.

Is there a proper way to receive input from console in UTF-8 encoding?

When getting input from std::cin in windows, the input is apparently always in the encoding windows-1252 (the default for the host machine in my case) despite all the configurations made, that apparently only affect to the output. Is there a proper…

c++ windows visual-studio utf-8 character-encoding

asked Mar 14 '22 at 09:27

Raul Luna

1,945
1
17
26

votes

1 answer

How to HTML encode or transliterate "high" characters in Excel?

In Excel, how can I convert the contents of a cell which includes accented characters, curly quotes etc into either HTML for the same characters, OR a transliterated plaintext version? We have an XLS document which contains some "high" characters.…

excel vba character-encoding export-to-csv

asked Aug 11 '11 at 02:23

Chris Burgess

3,551
3
29
42

votes

2 answers

Node.js buffer encoding issue

I'm having trouble understanding character encoding in node.js. I'm transmitting data and for some reason the encoding causes certain characters to be replaced with other ones. What I'm doing is base 64 encoding at the client side and decoding it in…

javascript node.js character-encoding base64 buffer

asked Aug 02 '11 at 19:02

pimvdb

151,816
78
307
352

votes

1 answer

Max length of base64 encoded salt/password for this algorithm

Below is the snippet of code that's used to hash passwords in an app I'm rewriting: internal string GenerateSalt() { byte[] buf = new byte[16]; (new RNGCryptoServiceProvider()).GetBytes(buf); return…

c# sql-server encoding character-encoding base64

asked Jul 29 '11 at 15:43

Andrey

20,487
26
108
176

votes

1 answer

Need some clarification about LC_COLLATE and LC_CTYPE

I have gone through the official postgres documentation to know about the LC_COLLATE and LC_TYPE. But, still I don't understand it correctly. Can anyone help me in understanding these concepts and impact of these, especially when we are trying to…

postgresql oracle character-encoding character-set

asked Jul 21 '21 at 13:01

vigneshwar reddy

votes

1 answer

java.lang.IllegalArgumentException: Illegal base64 character a

I have this string data to base64 decode String mfstr =…

java encoding character-encoding base64 urlencode

asked Jun 17 '21 at 12:09

hanan

votes

1 answer

AWS Polly - Highlighting special characters

I am using the AWS Polly service for text to speech. But if the text contains some special characters, it is returning the wrong start and end numbers. For example if the text is : "Böylelikle" it returns : …

swift string amazon-web-services character-encoding amazon-polly

asked Jun 09 '21 at 08:44

sametbilgi

votes

2 answers

ASP.NET - Invalid character in the given encoding .resx

I am adding a number of languages to a client's website using the App_LocalResource folder containing .resx files. The client's test application is hosted on a server with no outside Internet access so I have to remote desktop to the site and…

asp.net xml localization character-encoding

asked Jul 16 '11 at 13:05

TGuimond

5,475
6
41
51

votes

2 answers

How can I traverse directories named in Japanese in Python?

I'm trying to build a simple helper utility that will look through my projects and find and return the open ones to me via command line. But my calls to os.listdir return gibberish (example: '\x82\xa9\x82\xcc\x96I') whenever the folder or filename…

python ios unicode character-encoding internationalization

asked Jul 14 '11 at 08:52

StormShadow

1,589
4
25
33

votes

4 answers

Spring MVC response encoding issue

In last few hours I've read a lot concerning this topic, and so far nothing has worked. I'm trying to return response containing "odd" some characters. Here is example of that, quite simple : @ResponseBody @RequestMapping(value="test") …

java servlets spring-mvc character-encoding

asked Jul 09 '11 at 23:56

ant

22,634
36
132
182

votes

4 answers

Japanese mojibake detection

I want to know if there is a way to detect mojibake (Invalid) characters by their byte range. (For a simple example, detecting valid ascii characters is just to see if their byte values are less 128) Given the old customized characters sets, such…

unicode character-encoding

asked Jun 30 '11 at 15:04

James John McGuire 'Jahmic'

11,728
11
67
78

votes

1 answer

Importing CSS with @import in conjunction with @charset

If I use the following at the top of my CSS file home.css: @import "overall.css"; Do I need to (within home.css) redeclare @charset "utf-8"; or is this applied to the current CSS document when it's been defined within the included CSS?

css utf-8 import character-encoding

asked Jun 21 '11 at 08:28

Marty

39,033
19
93
162

votes

1 answer

The length of a compressed Java String is not equal to the content-length when it is sent as a WebSocket message

I am trying to reduce bandwidth consumption by compressing the JSON String I am sending through the WebSocket from my Springboot application to the browser client (this is on top of permessage-deflate WebSocket extension). This scenario uses the…

javascript java string websocket character-encoding

asked Sep 18 '20 at 08:32

Gideon

1,469
2
26
57

votes

4 answers

How can I convert non-ASCII characters encoded in UTF8 to ASCII-equivalent in Perl?

I have a Perl script that is being called by third parties to send me names of people who have registered my software. One of these parties encodes the names in UTF-8, so I have adapted my script accordingly to decode UTF-8 to ASCII with…

perl utf-8 character-encoding ascii

asked Mar 12 '09 at 10:40

Adrian Grigore

33,034
36
130
210

votes

2 answers

help() with unicode author string

In the beginning of my scripts in Python 2.6, I would like to write my name as it is spelled, i.e. "Joël" (with trema on e). So I write __author__ = u'Joël', and I can retrieve it by a simple print __author__. Problem appears with the built-in…

python character-encoding author pydoc

asked Jun 16 '11 at 14:21

Joël

2,723
18
36

Prev 1 2 3

…

99 100 Next

Questions tagged [character-encoding]

How Can I Fix the Encoding?

Which Character Encoding is This?

Common Questions

See Also