Questions tagged [character-encoding]

Character encoding refers to the way characters are represented as a series of bytes. Character encoding for the Web is defined in the Encoding Standard.

Character encoding is the act or result of representing characters (human-readable text/symbols such as a or 汉 or ) as a series of bytes (computer-readable zeroes and ones).

Briefly, just like changing the font from Arial to Wingdings changes what your text looks like, changing encodings affects the interpretation of a sequence of bytes. For example, depending on the encoding, the bytes¹ 0xE2 0x89 0xA0 could represent the text â‰ in Windows code page 1252, or Б┴═ in KOI8-R, or the character ≠ in UTF-8.

A useful reference is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

The Encoding Standard at https://encoding.spec.whatwg.org/ defines character encoding for the Web. It mandates the use of UTF-8 on the Web, and defines other encodings as legacy/obsolete.

Of course, if the file you are looking at does not contain text, that means it does not encode any characters, and thus, character encoding is not meaningful or well-defined. A common beginner problem is trying to read a binary file as text and being surprised that you get a character encoding error. But the fix in this situation is to read the file in binary mode instead. For example, many office document, audio, video, and image formats, and proprietary file formats are binary files.

How Can I Fix the Encoding?

If you are a beginner who just needs to fix an acute problem with a text file, see if your text editor provides an option to save a file in a different encoding. Understand that not all encodings can accommodate all characters (so, for example, Windows code page 1252 cannot save text which contains Chinese or Russian characters, emoji, etc) or, if you know the current encoding and what you want to change it into, try a tool like iconv or GNU recode.

Which Character Encoding is This?

Questions asking for help identifying or manipulating text in a particular encoding are frequent, but often problematic. Please include enough information to help us help you.

Bad: "I look at the text and I see óòÒöô, what is this"?

Good: "I have text in an unknown encoding in a file. I cannot view this text in UTF-8, but when I set my system to use ISO-8859-1, I see óòÒöô. I know this isn't right; the text is supposed to be <text> in <language>. A hex dump of the beginning of the file shows

    000000 9e 9f 9a a0 af b4 be f0  9e af b3 f2 20 b7 5f 20

Bad: Anything which tries to use the term "ANSI" in this context²

Legacy Microsoft Windows documentation misleadingly uses "ANSI" to refer to whichever character set is the default for the current locale. But this is a moving target; now, we have to guess your current locale, too.

Better: Specify the precise code page

Commonly on Western Windows installations, you will be using CP-1252; but of course, if you have to guess, you need to say so, too.

Notice:

We cannot guess which encoding you are using to look at the mystery data. Please include this information if you are genuinely trying to tell us what you see.
A copy/paste is rarely sufficient, because this introduces several additional variables (we will need to correctly guess about your web browser's handling of the text, too, and the web server's, and the tool you used to obtain a copy of the text, and so forth).
If you know what the text is supposed to represent (even vaguely) this can help narrow down the problem significantly.
A hex dump is the only unambiguous representation, but please don't overdo it -- a few lines of sample data should usually suffice.

Common Questions

¹ When talking about encoding, hex representations are often used since they are more concise -- 0xE2 is the hex representation of the byte 11100010.

² The American National Standards Institute has standardized some character sets (notably ASCII; ANSI standard ANSI X3.4-1986) and text display formatting codes, but certainly not the Microsoft Windows code pages or the mechanism for how one of them is selected.

wchar_t and encoding

If I want to convert a piece of string to UTF-16, say char * xmlbuffer, do I have to convert the type to wchar_t * before encoding to UTF-16? And is char* type reqired before encoding to UTF-8? How is wchar_t, char related to UTF-8 or UTF-16 or…

c++ character-encoding wchar-t

asked May 03 '12 at 21:38

Hunter

votes

3 answers

Reading non-ASCII characters from a text file

I'm using python 2.7. I've tried many things like codecs but didn't work. How can I fix this. myfile.txt wörd My code f = open('myfile.txt','r') for line in f: print line f.close() Output s\xc3\xb6zc\xc3\xbck Output is same on eclipse and…

python character-encoding

asked Apr 29 '12 at 23:30

Rckt

votes

2 answers

ASMX Web Service using wrong encoding on incoming request

My .NET ASMX webservice is accepting requests from a client I don't have direct control over. It's sending a request that looks like this: POST /Service.asmx HTTP/1.1 Connection: Keep-Alive Pragma: no-cache Content-Length: 1382 Content-Type:…

asp.net web-services encoding character-encoding asmx

asked Apr 19 '12 at 02:55

pettys

2,293
26
38

votes

2 answers

ColdFusion: convert accented regional characters to plain ASCII

I need to convert characters in French, Sweden and others language in their "normal" standard ASCII format. I don't know how to explain, here's an example: ç -> c ò -> o ... In bash Unix I would use iconv. How can I do in ColdFusion9 / Java?

utf-8 coldfusion character-encoding iconv diacritics

asked Mar 29 '12 at 13:38

Fabio B.

9,138
25
105
177

votes

3 answers

How to encode cyrillic characters for URL and then decode them?

I have a form on one page:

perl utf-8 character-encoding utf8-decode

asked Mar 22 '12 at 08:55

goe

votes

1 answer

Encoding that minimizes misreading / mistyping / misspeaking?

Let's say you have a system in which a fairly long key value can be accurately communicated to a user on-screen, via email or via paper; but the user needs to be able to communicate the key back to you accurately by reading it over the phone, or by…

encoding character-encoding error-correction human-readable

asked Mar 09 '12 at 18:36

Chris Johnson

20,650
6
81
80

votes

1 answer

What advantage is there to using UTF-8 over UTF-16?

Possible Duplicate: UTF8, UTF16, and UTF32 I am always reading things saying to write my source code in UTF-8 and stay way from other encodings, but it also seems like UTF-16 is an improved version of UTF-8. What is the difference between them,…

encoding utf-8 character-encoding utf-16

asked Mar 07 '12 at 00:25

Orcris

3,135
6
24
24

votes

2 answers

Unable to set DecoderFallback property of an Encoding type

I'm attempting to set the DecoderFallback property of an arbitrary (but supported) encoding in my C# app. Essentially what i'm trying to do is this: ASCIIEncoding ascii = new ASCIIEncoding(); ascii.DecoderFallback = new…

c# character-encoding

asked Jun 04 '09 at 21:52

Brian Sweeney

6,693
14
54
69

votes

2 answers

Compression algorithm that produces url safe data

I'm looking to store cookie data in a compact form. Is there such a thing as a compression algorithm that produces URL safe output? Currently my approach is String jsonData = GSON.toJson(data); byte[] cookieBinaryData =…

algorithm cookies character-encoding compression urlencode

asked Feb 29 '12 at 15:08

Maxim Veksler

29,272
38
131
151

votes

2 answers

Charset of JSP tags

Simple question about charset of JSP tags. <%@ page language="java" contentType="text/html; charset=UTF-8" pageEncoding="UTF-8"%> <%@taglib tagdir="/WEB-INF/tags" prefix="custom" %> mytag is simple .tag file…

jsp character-encoding jsp-tags

asked Feb 11 '12 at 18:10

user12384512

3,362
10
61
97

votes

1 answer

What should be the default encoding for an API which reads from an URL using the file: protocol?

I'm designing an API which takes an URL as an input, and reads the content at that URL. When the URL is a "file:" protocol, what would make a better default for the character encoding? the system's native encoding UTF-8 The API allows this to be…

file url character-encoding api-design

asked Feb 07 '12 at 17:32

Matthew Simoneau

6,199
6
35
46

votes

3 answers

Grails request parameters encoding issue in Tomcat

My grails app will not decode request parameters correctly. In config.groovy: grails.views.gsp.encoding = "UTF-8" grails.converters.encoding = "UTF-8" All my gsp's use contentType="text/html; charset=UTF-8" on the page directive as well as

tomcat grails character-encoding grails-orm

asked Feb 07 '12 at 16:04

Lefteris Laskaridis

2,292
2
24
38

votes

4 answers

rails, wicked-pdf gem and é à ö characters showing incorrectly

When I generate a PDF with text containing characters such as é è à and so on I do get funny characters instead. I know this must be related to encoding. I did try force_encoding("UTF-8") on the string with those characters with no success. joel

ruby-on-rails character-encoding wicked-pdf

asked Feb 05 '12 at 23:40

zabumba

12,172
16
72
129

votes

1 answer

binary-to-text encoding, non-printing characters, protocol buffers, mongodb and bson

I have a candidate key (mongodb candidate key, __id) thats looks like the following in protocol buffers : message qrs_signature { required uint32 region_id = 1; repeated fixed32 urls = 2; }; Naturally I can't use a protocol buffers encoded…

c++ mongodb character-encoding protocol-buffers bson

asked Jan 24 '12 at 14:50

Hassan Syed

20,075
11
87
171

votes

3 answers

need help in jquery unformjs select menu '&' character encoding

I am using uniformjs form controls which are working except the listmenu. When i add '&' symbol (&) inthe list menu, it renders correctly, but problem is coming when i change the value to different value and select again the value which has & symbol…

javascript jquery listview character-encoding uniform

asked Jan 11 '12 at 07:31

Ravi

4,015
7
30
35

Prev 1 2 3

…

99 100 Next

Questions tagged [character-encoding]

How Can I Fix the Encoding?

Which Character Encoding is This?

Common Questions

See Also