Questions tagged [character-encoding]

Character encoding refers to the way characters are represented as a series of bytes. Character encoding for the Web is defined in the Encoding Standard.

Character encoding is the act or result of representing characters (human-readable text/symbols such as a or or ) as a series of bytes (computer-readable zeroes and ones).

Briefly, just like changing the font from Arial to Wingdings changes what your text looks like, changing encodings affects the interpretation of a sequence of bytes. For example, depending on the encoding, the bytes1 0xE2 0x89 0xA0 could represent the text ≠in Windows code page 1252, or Б┴═ in KOI8-R, or the character in UTF-8.

A useful reference is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

The Encoding Standard at https://encoding.spec.whatwg.org/ defines character encoding for the Web. It mandates the use of UTF-8 on the Web, and defines other encodings as legacy/obsolete.

Of course, if the file you are looking at does not contain text, that means it does not encode any characters, and thus, character encoding is not meaningful or well-defined. A common beginner problem is trying to read a binary file as text and being surprised that you get a character encoding error. But the fix in this situation is to read the file in binary mode instead. For example, many office document, audio, video, and image formats, and proprietary file formats are binary files.

How Can I Fix the Encoding?

If you are a beginner who just needs to fix an acute problem with a text file, see if your text editor provides an option to save a file in a different encoding. Understand that not all encodings can accommodate all characters (so, for example, Windows code page 1252 cannot save text which contains Chinese or Russian characters, emoji, etc) or, if you know the current encoding and what you want to change it into, try a tool like iconv or GNU recode.

Which Character Encoding is This?

Questions asking for help identifying or manipulating text in a particular encoding are frequent, but often problematic. Please include enough information to help us help you.

Bad: "I look at the text and I see óòÒöô, what is this"?

Good: "I have text in an unknown encoding in a file. I cannot view this text in UTF-8, but when I set my system to use ISO-8859-1, I see óòÒöô. I know this isn't right; the text is supposed to be <text> in <language>. A hex dump of the beginning of the file shows

    000000 9e 9f 9a a0 af b4 be f0  9e af b3 f2 20 b7 5f 20

Bad: Anything which tries to use the term "ANSI" in this context2

Legacy Microsoft Windows documentation misleadingly uses "ANSI" to refer to whichever character set is the default for the current locale. But this is a moving target; now, we have to guess your current locale, too.

Better: Specify the precise code page

Commonly on Western Windows installations, you will be using CP-1252; but of course, if you have to guess, you need to say so, too.

Notice:

  • We cannot guess which encoding you are using to look at the mystery data. Please include this information if you are genuinely trying to tell us what you see.

  • A copy/paste is rarely sufficient, because this introduces several additional variables (we will need to correctly guess about your web browser's handling of the text, too, and the web server's, and the tool you used to obtain a copy of the text, and so forth).

  • If you know what the text is supposed to represent (even vaguely) this can help narrow down the problem significantly.

  • A hex dump is the only unambiguous representation, but please don't overdo it -- a few lines of sample data should usually suffice.

Common Questions


1 When talking about encoding, hex representations are often used since they are more concise -- 0xE2 is the hex representation of the byte 11100010.

2 The American National Standards Institute has standardized some character sets (notably ASCII; ANSI standard ANSI X3.4-1986) and text display formatting codes, but certainly not the Microsoft Windows code pages or the mechanism for how one of them is selected.

See Also

15132 questions
5
votes
0 answers

Displaying custom (personally created) Unicode characters on a site

Let's say I created a shorthand writing system with its own custom set of unique characters that don't exist in Unicode. Is there a way that I could: personally draw each character, then assign each character its own code in Unicode, then…
5
votes
1 answer

JavaScript CSV-Text download in ANSI (Windows-1252)

I try to create a CSV-File download with Javascript. We need to export Data from our Website to a 3rd party Program, the creation and download works pretty well. There is just one Problem, I need the CSV-File encoded in ANSI (Windows-1252) - the 3rd…
5
votes
3 answers

What character encoding does ObjectOutputStream 's writeObject method use?

I read that Java uses UTF-16 encoding internally. i.e. I understand that if I have like: String var = "जनमत"; then the "जनमत" will be encoded in UTF-16 internally. So, If I dump this variable to some file such as below: fileOut = new…
5
votes
2 answers

Read Csv file encoding error

I am using the following method for reading Csv file content: /// /// Reads data from a CSV file to a datatable /// /// Path to the CSV file /// Datatable filled with…
Germstorm
  • 9,709
  • 14
  • 67
  • 83
5
votes
1 answer

Read and write Unicode character from Json file using Python

I am trying to read the below json data using attached Python code (Python V3.5.1) but the issue is that Character representation ç as ç and £ as £. Please help me with the code which will correctly read and write data to and from the file,…
RintG
  • 61
  • 1
  • 4
5
votes
1 answer

How to open a text file with Excel in UTF-8 encoding?

I have a text file with ".tsv" extension. It has UTF-8 enconding and it contains cyrillic characters. When I try to open it with the function: "Open with"-> "Excel", Excel doesn't show the correct characters, while if I open it with Notepad++ in the…
Francesco
  • 352
  • 1
  • 8
  • 19
5
votes
1 answer

How Angular2 Http request can return body as binary?

I have a url that return the html content with charset=iso-8859-7 which means angulars http request convert the data to utf8 by default and i am unable to encode them back in iso-8859-7 properly. After a lot of searching i found out that many people…
Rambou
  • 968
  • 9
  • 22
5
votes
3 answers

C++ tolower on special characters such as ü

I have trouble transforming a string to lowercase with the tolower() function in C++. With normal strings, it works as expected, however special characters are not converted successfully. How I use my function: string NotLowerCase = "Grüßen"; string…
TVA van Hesteren
  • 1,031
  • 3
  • 20
  • 47
5
votes
1 answer

Java saving a file with special characters in file name

I'm having a problem on Java file encoding. I have a Java program will save a input stream as a file with a given file name, the code snippet is like: File out = new File(strFileName); Files.copy(inStream, out.toPath()); It works fine on Windows…
John.D
  • 311
  • 2
  • 16
5
votes
3 answers

How to convert strange strong/bold Unicode to non bold UTF-8 chars in php?

I'm trying to store a tweet in my database with twitter api, but I get this kind of strage chars which seems to be "naturals" bold chars NORMAL CHARS: azertyuio STRANGE CHARS: !! If I paste the strongs chars in my netbeans editor I get…
J. Doe
  • 119
  • 1
  • 8
5
votes
1 answer

Mysql data migration - wbcopytables charset

I am trying to move some data from MSSQL to MySQL. When I'm running wbcopytables.exe the charset on mysql connection seems to be wrong, I'm getting an error when the data contain emoji icons…
andy250
  • 19,284
  • 2
  • 11
  • 26
5
votes
0 answers

Encoded filename Telegram bot upload file

I try to upload file using telegram bot api. File is sent but the receiver sees the encoded filename if it contains russian letters. As I found out the filename is encoded with Base64 and Utf-8 string url = "https://api.telegram.org/bot" +…
Evg
  • 98
  • 2
  • 7
5
votes
3 answers

Convert hex-encoded String to String

I want to convert following hex-encoded String in Swift 3: dcb04a9e103a5cd8b53763051cef09bc66abe029fdebae5e1d417e2ffc2a07a4 to its equivalant String: ܰJ:\ص7cï ¼f«à)ýë®^A~/ü*¤ Following websites do the job very…
Chanchal Raj
  • 4,176
  • 4
  • 39
  • 46
5
votes
1 answer

Webserver overriding page encoding?

I have pages that I have manually coded in PHP more than 10 years ago. They are encoded in the old Hebrew encoding - windows-1255 Lately, they were all broken - text is shown as unrecognized UTF-8 characters - diamond with a question mark…
Hanan Cohen
  • 383
  • 2
  • 15
5
votes
1 answer

Convert data to keep accents before exporting to CSV

Using PHP, I'm exporting results of a query to CSV. My problem comes when the data contains accent; they are not exported correctly and I lose them all in the generated file. I used the utf8_decode() function to manually convert the headers and it…
Frank Parent
  • 2,136
  • 19
  • 33
1 2 3
99
100