Questions tagged [character-encoding]

Character encoding refers to the way characters are represented as a series of bytes. Character encoding for the Web is defined in the Encoding Standard.

Character encoding is the act or result of representing characters (human-readable text/symbols such as a or 汉 or ) as a series of bytes (computer-readable zeroes and ones).

Briefly, just like changing the font from Arial to Wingdings changes what your text looks like, changing encodings affects the interpretation of a sequence of bytes. For example, depending on the encoding, the bytes¹ 0xE2 0x89 0xA0 could represent the text â‰ in Windows code page 1252, or Б┴═ in KOI8-R, or the character ≠ in UTF-8.

A useful reference is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

The Encoding Standard at https://encoding.spec.whatwg.org/ defines character encoding for the Web. It mandates the use of UTF-8 on the Web, and defines other encodings as legacy/obsolete.

Of course, if the file you are looking at does not contain text, that means it does not encode any characters, and thus, character encoding is not meaningful or well-defined. A common beginner problem is trying to read a binary file as text and being surprised that you get a character encoding error. But the fix in this situation is to read the file in binary mode instead. For example, many office document, audio, video, and image formats, and proprietary file formats are binary files.

How Can I Fix the Encoding?

If you are a beginner who just needs to fix an acute problem with a text file, see if your text editor provides an option to save a file in a different encoding. Understand that not all encodings can accommodate all characters (so, for example, Windows code page 1252 cannot save text which contains Chinese or Russian characters, emoji, etc) or, if you know the current encoding and what you want to change it into, try a tool like iconv or GNU recode.

Which Character Encoding is This?

Questions asking for help identifying or manipulating text in a particular encoding are frequent, but often problematic. Please include enough information to help us help you.

Bad: "I look at the text and I see óòÒöô, what is this"?

Good: "I have text in an unknown encoding in a file. I cannot view this text in UTF-8, but when I set my system to use ISO-8859-1, I see óòÒöô. I know this isn't right; the text is supposed to be <text> in <language>. A hex dump of the beginning of the file shows

    000000 9e 9f 9a a0 af b4 be f0  9e af b3 f2 20 b7 5f 20

Bad: Anything which tries to use the term "ANSI" in this context²

Legacy Microsoft Windows documentation misleadingly uses "ANSI" to refer to whichever character set is the default for the current locale. But this is a moving target; now, we have to guess your current locale, too.

Better: Specify the precise code page

Commonly on Western Windows installations, you will be using CP-1252; but of course, if you have to guess, you need to say so, too.

Notice:

We cannot guess which encoding you are using to look at the mystery data. Please include this information if you are genuinely trying to tell us what you see.
A copy/paste is rarely sufficient, because this introduces several additional variables (we will need to correctly guess about your web browser's handling of the text, too, and the web server's, and the tool you used to obtain a copy of the text, and so forth).
If you know what the text is supposed to represent (even vaguely) this can help narrow down the problem significantly.
A hex dump is the only unambiguous representation, but please don't overdo it -- a few lines of sample data should usually suffice.

Common Questions

¹ When talking about encoding, hex representations are often used since they are more concise -- 0xE2 is the hex representation of the byte 11100010.

² The American National Standards Institute has standardized some character sets (notably ASCII; ANSI standard ANSI X3.4-1986) and text display formatting codes, but certainly not the Microsoft Windows code pages or the mechanism for how one of them is selected.

import utf-8 mysqldump to latin1 database

I have a dump file of a phpnuke site somehow in utf8. I'm trying to reopen this site on a new server. But nuke uses latin1. I need a way to create a latin1 database using this utf-8 dump file. I tried everything I could think of. iconv, mysql…

mysql character-encoding

asked Oct 30 '11 at 17:04

hctopcu

votes

2 answers

How do I write special characters (0x80..0x9F) to the Windows console?

I would like to have this code: System.Console.Out.WriteLine ("œil"); display œil instead of oil as it does in my test program. The Console.OutputEncoding is set by default to Western European (DOS) (CodePage set to 850 and WindowsCodePage set to…

c# character-encoding console windows-console

asked Oct 28 '11 at 13:15

Pierre Arnaud

10,212
11
77
108

votes

1 answer

Eclipselink characterEncoding equivalence

I have a problem with in JPA - EclipseLink characterEncoding problem. My project's persistent.xml is ;

jpa character-encoding eclipselink

asked Oct 26 '11 at 12:16

Rahman Usta

votes

1 answer

Charset not present in used JDK

I have a java system communicating that serves as a gateway to different systems (java, mainframe, etc). This java system receives a request using, for example, utf8 and converts it to the encoding of the target The issue is that there is a…

java character-encoding mainframe charset

asked Aug 19 '23 at 18:30

fnmps

votes

1 answer

Convert Between Latin1-encoded Data.ByteString and Data.Text

Since the latin-1 (aka ISO-8859-1) character set is embedded in the Unicode character set as its lowest 256 code-points, I'd expect the conversion to be trivial, but I didn't see any latin-1 encoding conversion functions in Data.Text.Encoding which…

haskell character-encoding latin1

asked Sep 25 '11 at 10:23

hvr

7,775
3
33
47

votes

3 answers

C++ doesn't convert the uppercase "I" character to lowercase correctly

I have this simple C++ code that converts uppercase characters to lowercase: #include #include #include #include #include int main() { std::wstring input_str = L"İiIı"; std::locale…

c++ character-encoding

asked Feb 12 '23 at 14:24

user2401856

votes

1 answer

Unknown characters

I read the string from file with encoding "UTF-8". And I need to match it to a expression. The first character of the file is #, but in the string the first is ''(empty symbol). I have translated it into bytes with charset "UTF-8", here it is [-17,…

java unicode character-encoding

asked Sep 21 '11 at 04:21

itun

3,439
12
51
75

votes

4 answers

jQuery ajax special characters problem

Okay, so here is the problem: I have a form on my php page. When a user has entered a name a presses submit a jQuery click event (on the submit button) collects then information and passes them on through $.ajax(). $.ajax({ url:…

php mysql jquery character-encoding special-characters

asked Sep 20 '11 at 12:21

Thor A. Pedersen

1,122
4
18
32

votes

1 answer

How to find charset of System.err if stdout is redirected?

Finding the charset of System.out is tricky. (See Logback System.err output uses wrong encoding for discussion and implications with Logback.) Here's what the System.out API documentation says. The "standard" output stream. This stream is already…

java character-encoding stdout stderr charset

asked May 30 '22 at 14:03

Garret Wilson

18,219
30
144
272

votes

4 answers

Newline control characters in multi-byte character sets

I have some Perl code that translates new-lines and line-feeds to a normalized form. The input text is Japanese, so that there will be multi-byte characters. Is it still possible to do this transformation on a byte-by-byte basis (which I think it…

unicode character-encoding newline cjk

asked Apr 07 '09 at 05:33

Thilo

257,207
101
511
656

votes

2 answers

Detecting non-ASCII characters in Rails

I am wondering if there's a way to detect non-ASCII characters in Rails. I have read that Rails does not use Unicode by default, and characters like Chinese and Japanese have assigned ranges in Unicode. Is there an easy way to detect these…

ruby-on-rails unicode character-encoding

asked Aug 26 '11 at 05:23

gerky

6,267
11
55
82

votes

1 answer

std::filesystem::path::u8string might not return valid UTF-8?

Consider this code, running on a Linux system (Compiler Explorer link): #include #include int main() { try { const char8_t bad_path[] = {0xf0, u8'a', 0}; // invalid utf-8, 0xf0 expects continuation bytes …

c++ character-encoding c++20

asked Apr 21 '22 at 18:42

user4520

3,401
1
27
50

votes

1 answer

Solr vs document encoding problems

I am using solrj 1.4. My solrj doesn't index properly the documents in utf-16 encoding. I guess when it tries to convert to unicode, it replaces the problematic utf-16 surrogate keys with unicode replaceable character U+FFFD. Can anyone guide me on…

java character-encoding solr

asked Aug 25 '11 at 03:28

user911084

votes

2 answers

Problems with UTF-8 encoding, JSP, jQuery, Spring

I have a web app with spring,jsp and jquery in a apache tomcat 6, one jsp page has a form that send the data with a ajax call made whit jquery, to a Spring MultiActionController on my back end. The problem is with the UTF-8 strings in the form…

jquery jsp spring-mvc utf-8 character-encoding

asked Aug 24 '11 at 04:36

h0m3r16

votes

3 answers

Firefox not displaying CP437

I am developing an application with a Web interface, that is connecting up to an old Cobol mainframe, that uses CP437. We only have one system to communicate with, so if possible I would rather not do any charset conversions, and just use CP437…

html firefox character-encoding

asked Aug 22 '11 at 14:53

asc99c

3,815
3
31
54

Prev 1 2 3

…

99 100 Next

Questions tagged [character-encoding]

How Can I Fix the Encoding?

Which Character Encoding is This?

Common Questions

See Also