Questions tagged [character-encoding]

Character encoding refers to the way characters are represented as a series of bytes. Character encoding for the Web is defined in the Encoding Standard.

Character encoding is the act or result of representing characters (human-readable text/symbols such as a or or ) as a series of bytes (computer-readable zeroes and ones).

Briefly, just like changing the font from Arial to Wingdings changes what your text looks like, changing encodings affects the interpretation of a sequence of bytes. For example, depending on the encoding, the bytes1 0xE2 0x89 0xA0 could represent the text ≠in Windows code page 1252, or Б┴═ in KOI8-R, or the character in UTF-8.

A useful reference is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

The Encoding Standard at https://encoding.spec.whatwg.org/ defines character encoding for the Web. It mandates the use of UTF-8 on the Web, and defines other encodings as legacy/obsolete.

Of course, if the file you are looking at does not contain text, that means it does not encode any characters, and thus, character encoding is not meaningful or well-defined. A common beginner problem is trying to read a binary file as text and being surprised that you get a character encoding error. But the fix in this situation is to read the file in binary mode instead. For example, many office document, audio, video, and image formats, and proprietary file formats are binary files.

How Can I Fix the Encoding?

If you are a beginner who just needs to fix an acute problem with a text file, see if your text editor provides an option to save a file in a different encoding. Understand that not all encodings can accommodate all characters (so, for example, Windows code page 1252 cannot save text which contains Chinese or Russian characters, emoji, etc) or, if you know the current encoding and what you want to change it into, try a tool like iconv or GNU recode.

Which Character Encoding is This?

Questions asking for help identifying or manipulating text in a particular encoding are frequent, but often problematic. Please include enough information to help us help you.

Bad: "I look at the text and I see óòÒöô, what is this"?

Good: "I have text in an unknown encoding in a file. I cannot view this text in UTF-8, but when I set my system to use ISO-8859-1, I see óòÒöô. I know this isn't right; the text is supposed to be <text> in <language>. A hex dump of the beginning of the file shows

    000000 9e 9f 9a a0 af b4 be f0  9e af b3 f2 20 b7 5f 20

Bad: Anything which tries to use the term "ANSI" in this context2

Legacy Microsoft Windows documentation misleadingly uses "ANSI" to refer to whichever character set is the default for the current locale. But this is a moving target; now, we have to guess your current locale, too.

Better: Specify the precise code page

Commonly on Western Windows installations, you will be using CP-1252; but of course, if you have to guess, you need to say so, too.

Notice:

  • We cannot guess which encoding you are using to look at the mystery data. Please include this information if you are genuinely trying to tell us what you see.

  • A copy/paste is rarely sufficient, because this introduces several additional variables (we will need to correctly guess about your web browser's handling of the text, too, and the web server's, and the tool you used to obtain a copy of the text, and so forth).

  • If you know what the text is supposed to represent (even vaguely) this can help narrow down the problem significantly.

  • A hex dump is the only unambiguous representation, but please don't overdo it -- a few lines of sample data should usually suffice.

Common Questions


1 When talking about encoding, hex representations are often used since they are more concise -- 0xE2 is the hex representation of the byte 11100010.

2 The American National Standards Institute has standardized some character sets (notably ASCII; ANSI standard ANSI X3.4-1986) and text display formatting codes, but certainly not the Microsoft Windows code pages or the mechanism for how one of them is selected.

See Also

15132 questions
5
votes
1 answer

import utf-8 mysqldump to latin1 database

I have a dump file of a phpnuke site somehow in utf8. I'm trying to reopen this site on a new server. But nuke uses latin1. I need a way to create a latin1 database using this utf-8 dump file. I tried everything I could think of. iconv, mysql…
hctopcu
  • 379
  • 4
  • 10
5
votes
2 answers

How do I write special characters (0x80..0x9F) to the Windows console?

I would like to have this code: System.Console.Out.WriteLine ("œil"); display œil instead of oil as it does in my test program. The Console.OutputEncoding is set by default to Western European (DOS) (CodePage set to 850 and WindowsCodePage set to…
Pierre Arnaud
  • 10,212
  • 11
  • 77
  • 108
5
votes
1 answer

Eclipselink characterEncoding equivalence

I have a problem with in JPA - EclipseLink characterEncoding problem. My project's persistent.xml is ;
Rahman Usta
  • 716
  • 3
  • 14
  • 28
5
votes
1 answer

Charset not present in used JDK

I have a java system communicating that serves as a gateway to different systems (java, mainframe, etc). This java system receives a request using, for example, utf8 and converts it to the encoding of the target The issue is that there is a…
fnmps
  • 143
  • 6
5
votes
1 answer

Convert Between Latin1-encoded Data.ByteString and Data.Text

Since the latin-1 (aka ISO-8859-1) character set is embedded in the Unicode character set as its lowest 256 code-points, I'd expect the conversion to be trivial, but I didn't see any latin-1 encoding conversion functions in Data.Text.Encoding which…
hvr
  • 7,775
  • 3
  • 33
  • 47
5
votes
3 answers

C++ doesn't convert the uppercase "I" character to lowercase correctly

I have this simple C++ code that converts uppercase characters to lowercase: #include #include #include #include #include int main() { std::wstring input_str = L"İiIı"; std::locale…
user2401856
  • 468
  • 4
  • 8
  • 22
5
votes
1 answer

Unknown characters

I read the string from file with encoding "UTF-8". And I need to match it to a expression. The first character of the file is #, but in the string the first is ''(empty symbol). I have translated it into bytes with charset "UTF-8", here it is [-17,…
itun
  • 3,439
  • 12
  • 51
  • 75
5
votes
4 answers

jQuery ajax special characters problem

Okay, so here is the problem: I have a form on my php page. When a user has entered a name a presses submit a jQuery click event (on the submit button) collects then information and passes them on through $.ajax(). $.ajax({ url:…
Thor A. Pedersen
  • 1,122
  • 4
  • 18
  • 32
5
votes
1 answer

How to find charset of System.err if stdout is redirected?

Finding the charset of System.out is tricky. (See Logback System.err output uses wrong encoding for discussion and implications with Logback.) Here's what the System.out API documentation says. The "standard" output stream. This stream is already…
Garret Wilson
  • 18,219
  • 30
  • 144
  • 272
5
votes
4 answers

Newline control characters in multi-byte character sets

I have some Perl code that translates new-lines and line-feeds to a normalized form. The input text is Japanese, so that there will be multi-byte characters. Is it still possible to do this transformation on a byte-by-byte basis (which I think it…
Thilo
  • 257,207
  • 101
  • 511
  • 656
5
votes
2 answers

Detecting non-ASCII characters in Rails

I am wondering if there's a way to detect non-ASCII characters in Rails. I have read that Rails does not use Unicode by default, and characters like Chinese and Japanese have assigned ranges in Unicode. Is there an easy way to detect these…
gerky
  • 6,267
  • 11
  • 55
  • 82
5
votes
1 answer

std::filesystem::path::u8string might not return valid UTF-8?

Consider this code, running on a Linux system (Compiler Explorer link): #include #include int main() { try { const char8_t bad_path[] = {0xf0, u8'a', 0}; // invalid utf-8, 0xf0 expects continuation bytes …
user4520
  • 3,401
  • 1
  • 27
  • 50
5
votes
1 answer

Solr vs document encoding problems

I am using solrj 1.4. My solrj doesn't index properly the documents in utf-16 encoding. I guess when it tries to convert to unicode, it replaces the problematic utf-16 surrogate keys with unicode replaceable character U+FFFD. Can anyone guide me on…
user911084
  • 53
  • 1
  • 4
5
votes
2 answers

Problems with UTF-8 encoding, JSP, jQuery, Spring

I have a web app with spring,jsp and jquery in a apache tomcat 6, one jsp page has a form that send the data with a ajax call made whit jquery, to a Spring MultiActionController on my back end. The problem is with the UTF-8 strings in the form…
h0m3r16
  • 303
  • 1
  • 3
  • 10
5
votes
3 answers

Firefox not displaying CP437

I am developing an application with a Web interface, that is connecting up to an old Cobol mainframe, that uses CP437. We only have one system to communicate with, so if possible I would rather not do any charset conversions, and just use CP437…
asc99c
  • 3,815
  • 3
  • 31
  • 54