Questions tagged [character-encoding]

Character encoding refers to the way characters are represented as a series of bytes. Character encoding for the Web is defined in the Encoding Standard.

Character encoding is the act or result of representing characters (human-readable text/symbols such as a or or ) as a series of bytes (computer-readable zeroes and ones).

Briefly, just like changing the font from Arial to Wingdings changes what your text looks like, changing encodings affects the interpretation of a sequence of bytes. For example, depending on the encoding, the bytes1 0xE2 0x89 0xA0 could represent the text ≠in Windows code page 1252, or Б┴═ in KOI8-R, or the character in UTF-8.

A useful reference is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

The Encoding Standard at https://encoding.spec.whatwg.org/ defines character encoding for the Web. It mandates the use of UTF-8 on the Web, and defines other encodings as legacy/obsolete.

Of course, if the file you are looking at does not contain text, that means it does not encode any characters, and thus, character encoding is not meaningful or well-defined. A common beginner problem is trying to read a binary file as text and being surprised that you get a character encoding error. But the fix in this situation is to read the file in binary mode instead. For example, many office document, audio, video, and image formats, and proprietary file formats are binary files.

How Can I Fix the Encoding?

If you are a beginner who just needs to fix an acute problem with a text file, see if your text editor provides an option to save a file in a different encoding. Understand that not all encodings can accommodate all characters (so, for example, Windows code page 1252 cannot save text which contains Chinese or Russian characters, emoji, etc) or, if you know the current encoding and what you want to change it into, try a tool like iconv or GNU recode.

Which Character Encoding is This?

Questions asking for help identifying or manipulating text in a particular encoding are frequent, but often problematic. Please include enough information to help us help you.

Bad: "I look at the text and I see óòÒöô, what is this"?

Good: "I have text in an unknown encoding in a file. I cannot view this text in UTF-8, but when I set my system to use ISO-8859-1, I see óòÒöô. I know this isn't right; the text is supposed to be <text> in <language>. A hex dump of the beginning of the file shows

    000000 9e 9f 9a a0 af b4 be f0  9e af b3 f2 20 b7 5f 20

Bad: Anything which tries to use the term "ANSI" in this context2

Legacy Microsoft Windows documentation misleadingly uses "ANSI" to refer to whichever character set is the default for the current locale. But this is a moving target; now, we have to guess your current locale, too.

Better: Specify the precise code page

Commonly on Western Windows installations, you will be using CP-1252; but of course, if you have to guess, you need to say so, too.

Notice:

  • We cannot guess which encoding you are using to look at the mystery data. Please include this information if you are genuinely trying to tell us what you see.

  • A copy/paste is rarely sufficient, because this introduces several additional variables (we will need to correctly guess about your web browser's handling of the text, too, and the web server's, and the tool you used to obtain a copy of the text, and so forth).

  • If you know what the text is supposed to represent (even vaguely) this can help narrow down the problem significantly.

  • A hex dump is the only unambiguous representation, but please don't overdo it -- a few lines of sample data should usually suffice.

Common Questions


1 When talking about encoding, hex representations are often used since they are more concise -- 0xE2 is the hex representation of the byte 11100010.

2 The American National Standards Institute has standardized some character sets (notably ASCII; ANSI standard ANSI X3.4-1986) and text display formatting codes, but certainly not the Microsoft Windows code pages or the mechanism for how one of them is selected.

See Also

15132 questions
5
votes
2 answers

MySQL connection character set problems

I'm using velosurf with MySQL on a Mac, without any encoding problems, but when I switch to a Linux machine, the values I get from velosurf are not encoded correctly. I found out it might be a problem with the default connection character set. On…
Tiago Fael Matos
  • 2,077
  • 4
  • 20
  • 34
5
votes
2 answers

Unable to write accents (Spanish) properly in java based programs

I tried asking this in more general forums since it's not directly related to programming but I was unable to find an answer, so here I am. When I try to type accented characters (like áéíóú) using the dead key method (the usual way in Spanish…
QOI
  • 231
  • 3
  • 11
5
votes
1 answer

Chrome lighthouse report returns `Properly defines charset` issue in Best Practices section

I have a VueJs SPA application. Everything is working fine. But when I run chrome lighthouse report, it returns Properly defines charset error. In fact I have added charset in my index.html file. Here are screenshot of issue. Chrome light house…
Muhammad Asfund
  • 97
  • 2
  • 10
5
votes
1 answer

How to make IE post FORM data in UTF-8?

This is continuation of this question: Java Jersey: Receive form parameter as byte array I need form data to be posted in UTF-8 even if containing page uses ISO-8859-1 charset. I found a solution for FF, but not for IE. Here is the whole story: I…
Dima L.
  • 3,443
  • 33
  • 30
5
votes
1 answer

How do i get python to write swedish letters(åäö) into a html file?

So the code I have copied an HTML file into a string and then changed everything to lower case except normal text and comments. The problem is it also changes the åäö into something the VS code can't recognise. What I can find is its a problem with…
5
votes
4 answers

Closing streams/sockets and try-catch-finally

Here is an example I have questions about (it comes from another SO question): public static void writeToFile (final String filename) { PrintWriter out = null; FileOutputStream fos = null; try { fos = new…
Cheetah
  • 13,785
  • 31
  • 106
  • 190
5
votes
6 answers

Error in encoding mysql -> How can I reconvert it to something else?

I started a website some time ago using the wrong CHARSET in my DB and site. The HTML was set to ISO... and the DB to Latin... , the page was saved in Western latin... a big mess. The site is in French, so I created a function that replaced all…
denislexic
  • 10,786
  • 23
  • 84
  • 128
5
votes
2 answers

Perl - File Encoding and Word Comparison

I have a file with one phrase/terms each line which i read to perl from STDIN. I have a list of stopwords (like "á", "são", "é") and i want to compare each one of them with each term, and remove if they are equal. The problem is that i'm not certain…
Barata
  • 2,799
  • 3
  • 22
  • 20
5
votes
4 answers

How to show Hindi text in android?

I am trying to paste Hindi characters in an array with elements like String[] arr = {"आपका स्वागत है","आपका स्वागत है"}; but its giving error i.e. "some characters cannot be mapped using "Cp1252" character encoding" while saving this.
Gkapoor
  • 840
  • 1
  • 13
  • 27
5
votes
1 answer

Reading Oracle data from SAS in a UTF-8 session - characters lose accent

SAS 9.4 M6 on a Unix server. SAS EG 8.1 client. When reading data in SAS from Oracle (10g release 10.2.0.4.0), special characters like "é", "â" are stripped from their accent so we end up with "e", "a". The result is the same whether we use libname…
FrankO
  • 61
  • 4
5
votes
6 answers

Java inputStreamReader Charset

I want to ping a target IP address and receive a response. To achieve this, I'm using windows command line in Java with runtime.exec method and process class. I'm getting the response using inputStreamReader. My default charset is windows-1254, it's…
Maozturk
  • 339
  • 1
  • 5
  • 20
5
votes
0 answers

Right to left languages in R and ggplot

I am trying to get Arabic text to display correctly in R on a Mac. Currently when i produce plots in Arabic, i have to switch to Windows. Windows correctly displays Arabic in R. However in Mac, I can't get Arabic to display right to left. I have…
Liam385
  • 101
  • 5
5
votes
2 answers

Change from HTML character references to utf-8 in a bash script ie. ā becomes ā

How would you go about translating a document that contains the following character references to their actual readable characters in a bash script? ā á ǎ à ē é ě è ī í ǐ ì ǖ ǘ…
Roninbaka
5
votes
2 answers

Printing Bidi text to an image

I have some code in Python using PIL, that will print UTF-8 characters to an image. I've noticed that for joining Bidi scripts like Arabic, the same code fails to connect characters correctly (the initial forms are only chosen, medial and final…
ct_
  • 1,189
  • 4
  • 20
  • 34
5
votes
1 answer

What is the character encoding Postman use to write multipart file data into request

I am writing a Java application to send multipart request with attached files to an API that help me send an email with the attachments to the specified email. The API was tested with Postman and can send the email properly. The request body was as…