Questions tagged [character-encoding]

Character encoding refers to the way characters are represented as a series of bytes. Character encoding for the Web is defined in the Encoding Standard.

Character encoding is the act or result of representing characters (human-readable text/symbols such as a or or ) as a series of bytes (computer-readable zeroes and ones).

Briefly, just like changing the font from Arial to Wingdings changes what your text looks like, changing encodings affects the interpretation of a sequence of bytes. For example, depending on the encoding, the bytes1 0xE2 0x89 0xA0 could represent the text ≠in Windows code page 1252, or Б┴═ in KOI8-R, or the character in UTF-8.

A useful reference is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

The Encoding Standard at https://encoding.spec.whatwg.org/ defines character encoding for the Web. It mandates the use of UTF-8 on the Web, and defines other encodings as legacy/obsolete.

Of course, if the file you are looking at does not contain text, that means it does not encode any characters, and thus, character encoding is not meaningful or well-defined. A common beginner problem is trying to read a binary file as text and being surprised that you get a character encoding error. But the fix in this situation is to read the file in binary mode instead. For example, many office document, audio, video, and image formats, and proprietary file formats are binary files.

How Can I Fix the Encoding?

If you are a beginner who just needs to fix an acute problem with a text file, see if your text editor provides an option to save a file in a different encoding. Understand that not all encodings can accommodate all characters (so, for example, Windows code page 1252 cannot save text which contains Chinese or Russian characters, emoji, etc) or, if you know the current encoding and what you want to change it into, try a tool like iconv or GNU recode.

Which Character Encoding is This?

Questions asking for help identifying or manipulating text in a particular encoding are frequent, but often problematic. Please include enough information to help us help you.

Bad: "I look at the text and I see óòÒöô, what is this"?

Good: "I have text in an unknown encoding in a file. I cannot view this text in UTF-8, but when I set my system to use ISO-8859-1, I see óòÒöô. I know this isn't right; the text is supposed to be <text> in <language>. A hex dump of the beginning of the file shows

    000000 9e 9f 9a a0 af b4 be f0  9e af b3 f2 20 b7 5f 20

Bad: Anything which tries to use the term "ANSI" in this context2

Legacy Microsoft Windows documentation misleadingly uses "ANSI" to refer to whichever character set is the default for the current locale. But this is a moving target; now, we have to guess your current locale, too.

Better: Specify the precise code page

Commonly on Western Windows installations, you will be using CP-1252; but of course, if you have to guess, you need to say so, too.

Notice:

  • We cannot guess which encoding you are using to look at the mystery data. Please include this information if you are genuinely trying to tell us what you see.

  • A copy/paste is rarely sufficient, because this introduces several additional variables (we will need to correctly guess about your web browser's handling of the text, too, and the web server's, and the tool you used to obtain a copy of the text, and so forth).

  • If you know what the text is supposed to represent (even vaguely) this can help narrow down the problem significantly.

  • A hex dump is the only unambiguous representation, but please don't overdo it -- a few lines of sample data should usually suffice.

Common Questions


1 When talking about encoding, hex representations are often used since they are more concise -- 0xE2 is the hex representation of the byte 11100010.

2 The American National Standards Institute has standardized some character sets (notably ASCII; ANSI standard ANSI X3.4-1986) and text display formatting codes, but certainly not the Microsoft Windows code pages or the mechanism for how one of them is selected.

See Also

15132 questions
5
votes
1 answer

How to read a file with long file name with unicode in Strawberry perl not using Win32::Unicode::File?

I have a file located in a directory, with danish characters in it, on a Windows XP machine. I use Strawberry perl and would like to read this file. The following code works fine: use Win32::Unicode::File; # Some code left out.... $fname…
Dr. Mike
  • 2,451
  • 4
  • 24
  • 36
5
votes
2 answers

Producing symbols from HTML characters in FPDF

I have a government client that requires the legal 'section symbol' (§) in their documents. When creating the documents in a web page, this symbol is created with § or §. I can not figure out how to get either of these to work in a pdf…
seveninstl
  • 824
  • 1
  • 9
  • 17
5
votes
2 answers

Changing character encoding

I am having problems changing the encoding on a text file in Ruby 1.9.2p290. I am getting the error invalid byte sequence in UTF-8 (ArgumentError). The problem (I think) lies in the fact that the charset seems to be unknown. From the command…
thilton
  • 51
  • 1
  • 2
5
votes
1 answer

htmlspecialchars ampersand

__('Details & Documents') ?>

The above prints out as: Details & Documents What is the proper syntax so that it prints as: Details & Documents? Thanks
vulgarbulgar
  • 845
  • 1
  • 13
  • 28
5
votes
2 answers

HTML special character decoding

Using Java on Android I'm struggling to convert a couple of html special characters. So far I've tried: String myString = "%A32.00%20per%20month%B3"; Html.fromHtml(myString).toString(); => %A32.00%20per%20month%B3 URLDecoder.decode(myString) =>…
scottyab
  • 23,621
  • 16
  • 94
  • 105
5
votes
4 answers

Is it significantly better to use ISO-8859-1 rather than UTF-8 wherever possible?

For globalization of scripts, it is very common to use UTF-8 as the default charset; for example in HTML or default charset of mysql. This is also the case for latin website in which characters are in the class of ISO-8859-1. Isn't it advantageous…
Googlebot
  • 15,159
  • 44
  • 133
  • 229
5
votes
10 answers

How to convert large UTF-8 strings into ASCII?

I need to convert large UTF-8 strings into ASCII. It should be reversible, and ideally a quick/lightweight algorithm. How can I do this? I need the source code (using loops) or the JavaScript code. (should not be dependent on any…
Robin Rodricks
  • 110,798
  • 141
  • 398
  • 607
5
votes
4 answers

How do I display non-english characters in python?

I have a python dictionary which contains items that have non-english characters. When I print the dictionary, the python shell does not properly display the non-english characters. How can I fix this?
alwbtc
  • 28,057
  • 62
  • 134
  • 188
5
votes
8 answers

PHP character encoding problems

I need help with a character encoding problem that I want to sort once and for all. Here is an example of some content which I pull from a XML feed, insert into my database and then pull out. As you can not see, a lot of special html characters get…
James
  • 5,942
  • 15
  • 48
  • 72
5
votes
1 answer

Fix PDF encoding

I have Arabic PDF Files and it seems that there are something wrong in its encoding . When I try to search in the PDF for word inside it , it didn't find results when I try to export the pdf contents to Excel using other programs it export data in a…
M_1100
  • 67
  • 1
  • 1
  • 7
5
votes
2 answers

Italic greek letters / latex-style math in plot titles

I'd like to create latex-style math in plot titles in R. The plotmath tools have a useful but limited subset of expressions they can display, and use non-latex syntax and style. For instance, I would like to print the equation $\mathrm{d}…
cboettig
  • 12,377
  • 13
  • 70
  • 113
5
votes
2 answers

Codeigniter and charsets

I'm using Codeigniter not for so long but I've some charset problems.. I'm asking around at the CI Forum, but I want to go further, still no global solution: http://codeigniter.com/forums/viewthread/204409/ The problem was a database error 1064.…
Roy
  • 4,254
  • 5
  • 28
  • 39
5
votes
2 answers

How to print [Simplified] Chinese characters to Eclipse console?

I have the following code: import java.io.PrintStream; import java.io.UnsupportedEncodingException; import java.util.Locale; public final class ChineseCharacterDemo { public static void main(String[] args) throws UnsupportedEncodingException…
mre
  • 43,520
  • 33
  • 120
  • 170
5
votes
2 answers

How do I set character encoding for Oracle 10g with JDBC

I am using Java and Oracle 10g database. How can I specify the character encoding like UTF-8 for the Oracle database with JDBC? And how can I find out the current encoding used by JDBC?
Jack
  • 61
  • 1
  • 1
  • 3
5
votes
1 answer

adding a char encoding to ruby 1.9.x?

If one wanted to add a new char encoding to 1.9.x, supported just the same as the built-in encodings, how would you go about doing it? Can you do it with code in ruby, or would it require a C patch in MRI? (I don't think it matters, but I am…
jrochkind
  • 22,799
  • 12
  • 59
  • 74