Questions tagged [character-encoding]

Character encoding refers to the way characters are represented as a series of bytes. Character encoding for the Web is defined in the Encoding Standard.

Character encoding is the act or result of representing characters (human-readable text/symbols such as a or or ) as a series of bytes (computer-readable zeroes and ones).

Briefly, just like changing the font from Arial to Wingdings changes what your text looks like, changing encodings affects the interpretation of a sequence of bytes. For example, depending on the encoding, the bytes1 0xE2 0x89 0xA0 could represent the text ≠in Windows code page 1252, or Б┴═ in KOI8-R, or the character in UTF-8.

A useful reference is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

The Encoding Standard at https://encoding.spec.whatwg.org/ defines character encoding for the Web. It mandates the use of UTF-8 on the Web, and defines other encodings as legacy/obsolete.

Of course, if the file you are looking at does not contain text, that means it does not encode any characters, and thus, character encoding is not meaningful or well-defined. A common beginner problem is trying to read a binary file as text and being surprised that you get a character encoding error. But the fix in this situation is to read the file in binary mode instead. For example, many office document, audio, video, and image formats, and proprietary file formats are binary files.

How Can I Fix the Encoding?

If you are a beginner who just needs to fix an acute problem with a text file, see if your text editor provides an option to save a file in a different encoding. Understand that not all encodings can accommodate all characters (so, for example, Windows code page 1252 cannot save text which contains Chinese or Russian characters, emoji, etc) or, if you know the current encoding and what you want to change it into, try a tool like iconv or GNU recode.

Which Character Encoding is This?

Questions asking for help identifying or manipulating text in a particular encoding are frequent, but often problematic. Please include enough information to help us help you.

Bad: "I look at the text and I see óòÒöô, what is this"?

Good: "I have text in an unknown encoding in a file. I cannot view this text in UTF-8, but when I set my system to use ISO-8859-1, I see óòÒöô. I know this isn't right; the text is supposed to be <text> in <language>. A hex dump of the beginning of the file shows

    000000 9e 9f 9a a0 af b4 be f0  9e af b3 f2 20 b7 5f 20

Bad: Anything which tries to use the term "ANSI" in this context2

Legacy Microsoft Windows documentation misleadingly uses "ANSI" to refer to whichever character set is the default for the current locale. But this is a moving target; now, we have to guess your current locale, too.

Better: Specify the precise code page

Commonly on Western Windows installations, you will be using CP-1252; but of course, if you have to guess, you need to say so, too.

Notice:

  • We cannot guess which encoding you are using to look at the mystery data. Please include this information if you are genuinely trying to tell us what you see.

  • A copy/paste is rarely sufficient, because this introduces several additional variables (we will need to correctly guess about your web browser's handling of the text, too, and the web server's, and the tool you used to obtain a copy of the text, and so forth).

  • If you know what the text is supposed to represent (even vaguely) this can help narrow down the problem significantly.

  • A hex dump is the only unambiguous representation, but please don't overdo it -- a few lines of sample data should usually suffice.

Common Questions


1 When talking about encoding, hex representations are often used since they are more concise -- 0xE2 is the hex representation of the byte 11100010.

2 The American National Standards Institute has standardized some character sets (notably ASCII; ANSI standard ANSI X3.4-1986) and text display formatting codes, but certainly not the Microsoft Windows code pages or the mechanism for how one of them is selected.

See Also

15132 questions
6
votes
3 answers

write special characters into excel table by python package pyExcelerator/xlwt

Task: I generate formated excel tables from csv-files by using the python package pyExcelerator (comparable with xlwt). I need to be able to write less-than-or-equal-to (≤) and greater-than-or-equal-to (≥) signs. So far: I can save my table as…
SimonSalman
  • 351
  • 1
  • 6
  • 13
6
votes
2 answers

Where can I specify my javadoc document charset?

I'm writing javadoc in Polish language and I want to define UTF-8 charset for my javadoc documentation generated by eclipse - how and where can I do that?
pawel
  • 5,976
  • 15
  • 46
  • 68
6
votes
2 answers

how to write with a single byte character encoding?

I have a webservice that returns the config file to a low level hardware device. The manufacturer of this device tells me he only supports single byte charactersets for this config file. On this wiki page I found out that the following should be…
Sjors Miltenburg
  • 2,540
  • 4
  • 33
  • 60
6
votes
1 answer

Superscript character in PHP causing a MySQLi select query to find 0 rows

I am using PHP 5.3.3 and MySQL 5.1.61. The column in question is using UTF-8 encoding and the PHP file is encoded in UTF-8 without BOM. When doing a MySQLi query with a ² character in SQLyog on Windows, the query executes properly and the correct…
Kevin Ghadyani
  • 6,829
  • 6
  • 44
  • 62
6
votes
4 answers

getBytes() With UTF-8 Doesn't Work for Upper-Case German Umlauts

For development I'm using ResourceBundle to read a UTF-8 encoded properties-file (I set that in Eclipse' file properties on that file) directly from my resources-directory in the IDE (native2ascii is used on the way to production),…
sjngm
  • 12,423
  • 14
  • 84
  • 114
6
votes
1 answer

Nokogiri fails outputting XML with UTF-16 declaration (understanding and working around)

Summary Attempting to read and serialize XML documents that have a UTF-16 encoding and declaration causes Nokogiri to produce garbage after a certain point. Is this a bug, or is there a reasonable explanation for this? What's the best way to avoid…
Phrogz
  • 296,393
  • 112
  • 651
  • 745
6
votes
2 answers

list of garbage characters like ’

I am using librets to retrieve data form my RETS Server. Somehow librets Encoding method is not working and I am receiving some weird characters in my output. I noticed characters like '’' is replaced with ’. I am unable to find a fix for librets…
ZafarYousafi
  • 8,640
  • 5
  • 33
  • 39
6
votes
1 answer

How to correctly decode text files from FileSystemReadStream in Pharo 1.4

In Pharo 1.4 i opened a FileSystemReadStream on a text file and transformed it to a String via aFileSystemReadStream contents asString. My text files are UTF8 encoded and have those Windows (CR LF) linebreaks. The resulting Pharo Strings have two…
Helene Bilbo
  • 1,142
  • 7
  • 20
6
votes
2 answers

Spring MVC: CharacterEncodingFilter; why only set response encoding by force?

I was having a look at the CharacterEncodingFilter provided by Spring MVC. I was wondering why it was only possible to set the response encoding when the request encoding was forced to the given encoding? Why not be able to set a default response…
Martin Becker
  • 3,331
  • 3
  • 22
  • 25
6
votes
4 answers

MySQL European Characters

I can't figure this out for the life of me. I have a query that pulls translations of elements on a page. So any number of 15 languages can appear on that page. When I start to add languages like Swedish anything that has a symbol such as ö results…
Peter
  • 3,144
  • 11
  • 37
  • 56
6
votes
1 answer

What does this mojibake/krakozyabry on The Simpsons say?

On Season 12 Episode 07 "The Great Money Caper" of The Simpsons, I noticed a few years ago "gibberish" signs on the Russian spaceship. Randomly today, I decided to search and see if anyone decoded them but couldn't find any results. I suspect that…
chfoo
  • 386
  • 3
  • 13
6
votes
2 answers

PHP / MySQL - Safe characters for display names / usernames / passwords, with PDO

a bit of a PHP / MySQL newbie here... I've been building a PHP-based site that uses a MySQL database for storing user information, like their display names, usernames, and passwords. I've been learning about escaping, prepared statements and the…
Jackson
  • 9,188
  • 6
  • 52
  • 77
6
votes
3 answers

Decoding Korean text files from the 90s

I have a collection of .html files created in the mid-90s, which include a significant ammount of Korean text. The HTML lacks character set metadata, so of course all of the Korean text now does not render properly. The following examples will all…
dongle
  • 599
  • 1
  • 4
  • 17
6
votes
1 answer

Get Encoding fails when I build Monodroid project with unshared runtime

I am trying to use the DotNetZip library in my Monodroid app. Everything seems to work fine when I enable the Shared Runtime build option. When I disable Shared Runtime, the line static System.Text.Encoding ibm437 =…
Ash
  • 400
  • 3
  • 9
6
votes
3 answers

Can there be 2 different UTF-8 encodings for the same character?

I'm writing an application that needs to transcode its input from UTF-8 to ISO-8859-1 (Latin 1). All works fine, except I sometimes get strange encodings for some umlaut characters. For example the Latin 1 E with 2 dots (0xEB) usually comes as UTF-8…
Gene Vincent
  • 5,237
  • 9
  • 50
  • 86