Questions tagged [character-encoding]

Character encoding refers to the way characters are represented as a series of bytes. Character encoding for the Web is defined in the Encoding Standard.

Character encoding is the act or result of representing characters (human-readable text/symbols such as a or or ) as a series of bytes (computer-readable zeroes and ones).

Briefly, just like changing the font from Arial to Wingdings changes what your text looks like, changing encodings affects the interpretation of a sequence of bytes. For example, depending on the encoding, the bytes1 0xE2 0x89 0xA0 could represent the text ≠in Windows code page 1252, or Б┴═ in KOI8-R, or the character in UTF-8.

A useful reference is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

The Encoding Standard at https://encoding.spec.whatwg.org/ defines character encoding for the Web. It mandates the use of UTF-8 on the Web, and defines other encodings as legacy/obsolete.

Of course, if the file you are looking at does not contain text, that means it does not encode any characters, and thus, character encoding is not meaningful or well-defined. A common beginner problem is trying to read a binary file as text and being surprised that you get a character encoding error. But the fix in this situation is to read the file in binary mode instead. For example, many office document, audio, video, and image formats, and proprietary file formats are binary files.

How Can I Fix the Encoding?

If you are a beginner who just needs to fix an acute problem with a text file, see if your text editor provides an option to save a file in a different encoding. Understand that not all encodings can accommodate all characters (so, for example, Windows code page 1252 cannot save text which contains Chinese or Russian characters, emoji, etc) or, if you know the current encoding and what you want to change it into, try a tool like iconv or GNU recode.

Which Character Encoding is This?

Questions asking for help identifying or manipulating text in a particular encoding are frequent, but often problematic. Please include enough information to help us help you.

Bad: "I look at the text and I see óòÒöô, what is this"?

Good: "I have text in an unknown encoding in a file. I cannot view this text in UTF-8, but when I set my system to use ISO-8859-1, I see óòÒöô. I know this isn't right; the text is supposed to be <text> in <language>. A hex dump of the beginning of the file shows

    000000 9e 9f 9a a0 af b4 be f0  9e af b3 f2 20 b7 5f 20

Bad: Anything which tries to use the term "ANSI" in this context2

Legacy Microsoft Windows documentation misleadingly uses "ANSI" to refer to whichever character set is the default for the current locale. But this is a moving target; now, we have to guess your current locale, too.

Better: Specify the precise code page

Commonly on Western Windows installations, you will be using CP-1252; but of course, if you have to guess, you need to say so, too.

Notice:

  • We cannot guess which encoding you are using to look at the mystery data. Please include this information if you are genuinely trying to tell us what you see.

  • A copy/paste is rarely sufficient, because this introduces several additional variables (we will need to correctly guess about your web browser's handling of the text, too, and the web server's, and the tool you used to obtain a copy of the text, and so forth).

  • If you know what the text is supposed to represent (even vaguely) this can help narrow down the problem significantly.

  • A hex dump is the only unambiguous representation, but please don't overdo it -- a few lines of sample data should usually suffice.

Common Questions


1 When talking about encoding, hex representations are often used since they are more concise -- 0xE2 is the hex representation of the byte 11100010.

2 The American National Standards Institute has standardized some character sets (notably ASCII; ANSI standard ANSI X3.4-1986) and text display formatting codes, but certainly not the Microsoft Windows code pages or the mechanism for how one of them is selected.

See Also

15132 questions
5
votes
2 answers

C#, UTF-8 and encoding characters

This is a shot-in-the-dark, and I apologize in advance if this question sounds like the ramblings of a madman. As part of an integration with a third party, I need to UTF8-encode some string info using C# so I can send it to the target server via…
Mass Dot Net
  • 2,150
  • 9
  • 38
  • 50
5
votes
1 answer

Java: Runtime.getRuntime().exec() passes arguments in unicode when it shouldn't

My problem is best explained in an example: The following program is run on a Linux system that is not in Unicode mode yet, but in ISO-8859-15. The environment is set as follows: LC_CTYPE=de_DE@euro import java.io.*; import java.util.*; public…
Erich Kitzmueller
  • 36,381
  • 5
  • 80
  • 102
5
votes
2 answers

Convert unicode to Chinese characters

Supposing I have a string of code like so: \u00e5\u00b1\u00b1\u00e4\u00b8\u008a\u00e7\u009a\u0084\u00e4\u00ba\u00ba How would I convert these back into Chinese characters using Javascript: 山上的人 This is so that I can actually display Chinese on my…
dthree
  • 19,847
  • 14
  • 77
  • 106
5
votes
1 answer

What are the ramifications of storing UTF8 text in a Latin1 database?

I have a mysql database in default charset latin1 mysql> SELECT SCHEMA_NAME 'database', default_character_set_name 'charset', DEFAULT_COLLATION_NAME 'collation' FROM information_schema.SCHEMATA…
david_adler
  • 9,690
  • 6
  • 57
  • 97
5
votes
1 answer

WebClient.DownloadString uses wrong encoding

I'm downloading XML files from sharepoint online using webclient. However, when I use WebClient.DownloadString(string url) method, some characters are not correctly decoded. When I use WebClient.DownloadFile(string url, string file) and then I read…
Liero
  • 25,216
  • 29
  • 151
  • 297
5
votes
3 answers

How to find out charset of text file loaded by input[type="file"] in Javascript

I want to read user's file and gave him modified version of this file. I use input with type file to get text file, but how I can get charset of loaded file, because in different cases it can be various... Uploaded file has format .txt or something…
5
votes
2 answers

Best way to translate UTF-8 to ISO8859-1 in Go

I'm trying to map UTF-8 characters to their "similar" ISO8859-1 representation. Removing diacritics, but also replacing characters like Ł with L or ı with i. Example: José Kakışır should become Jose Kakisir. I'm aware that removing diacritics can…
derFunk
  • 1,587
  • 2
  • 20
  • 31
5
votes
2 answers

JavaScript - How do I convert unicode characters? English numbers to Persian numbers

I'm building a software that takes integers from users and does some calculations and then outputs the result. The thing is that I want to take users numbers using English numbers(0, 1, 2, etc.) and I want to present the numbers using Persian…
Amirhosein Al
  • 470
  • 6
  • 18
5
votes
2 answers

What does character encoding in C programming language depend on?

What does character encoding in C programming language depend on? (OS? compiler? or editor?) I'm working on not only characters of ASCII but also ones of other encodings such as UTF-8. How can we check the current character encodings in C?
mallea
  • 534
  • 6
  • 17
5
votes
1 answer

Python 3.x requests redirect with unicode character

I am trying to get the following URL with requests.get() in Python 3.x: http://www.finanzen.net/suchergebnis.asp?strSuchString=DE0005933931 (this URL consists of a base URL with the search string DE0005933931). The request gets redirected (via HTTP…
bastelflp
  • 9,362
  • 7
  • 32
  • 67
5
votes
1 answer

Python encodes (Korean) characters in an unexpected way with euc-kr encoding (codecs, encodings module)

I tried to read some Korean text file encoded in 'euc-kr' in python but had some errors raised. After inspecting encodings module for a while, I learned that this module encodes Korean characters seemingly very weird way. Let me take an…
user5538922
5
votes
1 answer

Python 3: CSV utf-8 encoding

I'm trying to write a CSV with non-ascii character using Python 3. import csv with open('sample.csv', 'w', newline='', encoding='utf-8') as csvfile: spamwriter = csv.writer(csvfile, delimiter=' ', quotechar='|',…
user1187968
  • 7,154
  • 16
  • 81
  • 152
5
votes
2 answers

How to change Project Character Set in JetBrains Clion

I can change Project Character Set to Unicode or to Multi-Byte in Microsoft visual studio like what is shown in the picture. But, is the same thing possible in clion?
ilw
  • 2,499
  • 5
  • 30
  • 54
5
votes
3 answers

Encode to single byte extended ascii values

In C# is there a way to encode the extended ascii values (128-255) into their single byte values as shown here: http://asciitable.com/ I've tried using Encoding.UTF8.GetBytes() but that returns multi byte values for the extended codes. I don't need…
Adam Haile
  • 30,705
  • 58
  • 191
  • 286
5
votes
1 answer

Sending Hebrew subject in php mail goes Klingon...?

I'm trying to send email with hebrew content/subject like so: $to = 'email@email.com'; $subject = "איזה יום יפה היום"; $message = 'ממש יום יפה'; $headers = 'From: email@email.com' . "\r\n"; $headers .= 'MIME-Version: 1.0' . "\r\n"; $headers .=…
Gal
  • 23,122
  • 32
  • 97
  • 118
1 2 3
99
100