Questions tagged [character-encoding]

Character encoding refers to the way characters are represented as a series of bytes. Character encoding for the Web is defined in the Encoding Standard.

Character encoding is the act or result of representing characters (human-readable text/symbols such as a or or ) as a series of bytes (computer-readable zeroes and ones).

Briefly, just like changing the font from Arial to Wingdings changes what your text looks like, changing encodings affects the interpretation of a sequence of bytes. For example, depending on the encoding, the bytes1 0xE2 0x89 0xA0 could represent the text ≠in Windows code page 1252, or Б┴═ in KOI8-R, or the character in UTF-8.

A useful reference is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

The Encoding Standard at https://encoding.spec.whatwg.org/ defines character encoding for the Web. It mandates the use of UTF-8 on the Web, and defines other encodings as legacy/obsolete.

Of course, if the file you are looking at does not contain text, that means it does not encode any characters, and thus, character encoding is not meaningful or well-defined. A common beginner problem is trying to read a binary file as text and being surprised that you get a character encoding error. But the fix in this situation is to read the file in binary mode instead. For example, many office document, audio, video, and image formats, and proprietary file formats are binary files.

How Can I Fix the Encoding?

If you are a beginner who just needs to fix an acute problem with a text file, see if your text editor provides an option to save a file in a different encoding. Understand that not all encodings can accommodate all characters (so, for example, Windows code page 1252 cannot save text which contains Chinese or Russian characters, emoji, etc) or, if you know the current encoding and what you want to change it into, try a tool like iconv or GNU recode.

Which Character Encoding is This?

Questions asking for help identifying or manipulating text in a particular encoding are frequent, but often problematic. Please include enough information to help us help you.

Bad: "I look at the text and I see óòÒöô, what is this"?

Good: "I have text in an unknown encoding in a file. I cannot view this text in UTF-8, but when I set my system to use ISO-8859-1, I see óòÒöô. I know this isn't right; the text is supposed to be <text> in <language>. A hex dump of the beginning of the file shows

    000000 9e 9f 9a a0 af b4 be f0  9e af b3 f2 20 b7 5f 20

Bad: Anything which tries to use the term "ANSI" in this context2

Legacy Microsoft Windows documentation misleadingly uses "ANSI" to refer to whichever character set is the default for the current locale. But this is a moving target; now, we have to guess your current locale, too.

Better: Specify the precise code page

Commonly on Western Windows installations, you will be using CP-1252; but of course, if you have to guess, you need to say so, too.

Notice:

  • We cannot guess which encoding you are using to look at the mystery data. Please include this information if you are genuinely trying to tell us what you see.

  • A copy/paste is rarely sufficient, because this introduces several additional variables (we will need to correctly guess about your web browser's handling of the text, too, and the web server's, and the tool you used to obtain a copy of the text, and so forth).

  • If you know what the text is supposed to represent (even vaguely) this can help narrow down the problem significantly.

  • A hex dump is the only unambiguous representation, but please don't overdo it -- a few lines of sample data should usually suffice.

Common Questions


1 When talking about encoding, hex representations are often used since they are more concise -- 0xE2 is the hex representation of the byte 11100010.

2 The American National Standards Institute has standardized some character sets (notably ASCII; ANSI standard ANSI X3.4-1986) and text display formatting codes, but certainly not the Microsoft Windows code pages or the mechanism for how one of them is selected.

See Also

15132 questions
5
votes
1 answer

Writing out results from python to csv file [UnicodeEncodeError: 'charmap' codec can't encode character

I've been trying to write a script that would potentially scrape the list of usernames off the comments section on a defined YouTube video and paste those usernames onto a .csv file. Here's the script : from selenium import webdriver import…
5
votes
2 answers

Why it's necessary to specify the character encoding in an HTML5 document if the default character encoding for HTML5 is UTF-8?

I've following HTML5 document :

Beträge: 20€

The output of above cod is as below : Beträge: 20€ The I tried below HTML5 code : …
user10318083
5
votes
5 answers

Who can decode this code?

Here are a few samples of strange code I see in our access logs. Can anyone decode this? For example: \xb3\xe1\xdd=H\t\xd5\xd2\xf0ml\xf1\x10\xee/\xa0$\xeaY\xa5\xe7\x81d \xd5\x1f\xd9…
schuilr
  • 674
  • 1
  • 7
  • 11
5
votes
2 answers

PHP DOMDocument nodeValue dumps literal UTF-8 characters instead of encoded

I am experiencing an issue similar to this question: nodeValue from DomDocument returning weird characters in PHP The root cause that I have found can be mimicked with mb_convert_encoding() In my unit tests, this finally caught the issue: $test =…
Dave Espionage
  • 136
  • 2
  • 7
5
votes
2 answers

Python convert strings of bytes to byte array

For example given an arbitrary string. Could be chars or just random bytes: string = '\xf0\x9f\xa4\xb1' I want to output: b'\xf0\x9f\xa4\xb1' This seems so simple, but I could not find an answer anywhere. Of course just typing the b followed by…
AznBoyStride
  • 305
  • 2
  • 12
5
votes
0 answers

read_csv() with german special character in path

I am trying to import a CSV, but my windows username has an "ö" in it. library(tidyverse) persons <- read_csv("./data/persons.csv") This is the error message (anonymized) Error in guess_header_(datasource, tokenizer, locale) : Cannot read …
PalimPalim
  • 2,892
  • 1
  • 18
  • 40
5
votes
3 answers

convert/normalize special characters when using jspdf

Trying to use the jspdf lib @1.4.1 to convert text to pdf, the output sometimes gets so ugly and unreadable, because the text contains some special characters, like: the left single quotation mark U+2018, or the right one U+2019, or symbols like →,…
Bonnard
  • 389
  • 2
  • 8
  • 26
5
votes
4 answers

Oracle Unicode problem when using NLS_CHARACTERSET is WE8ISO8859P1 and NLS_NCHAR_CHARACTERSET is AL16UTF16, and ColdFusion as programming language

I have 2 Oracle 10g database, XE and Enterprise XE Enterprise and this are the data type I've use in the test table and then I tried to test to insert some Unicode char from http://www.sustainablegis.com/unicode/ and the results…
tsurahman
  • 1,892
  • 5
  • 17
  • 26
5
votes
1 answer

String encoding via ruby: capturing user input safely

I've searched high and low for a simple solution. None have been simple or 'just worked'. To start, I keep getting this error: ArgumentError: invalid byte sequence in US-ASCII This happens because users are copying and pasting content from…
Binary Logic
  • 2,562
  • 7
  • 31
  • 39
5
votes
1 answer

cURL config file (-k / --config) JSON newlines

I'm trying to construct a cURL config file that contains newlines in the -d/--data body but it doesn't seem to work the same as on the command line. On the command line I can run: curl -XPUT 'http://localhost:9200/mytype/_search' -d '{ "query": { …
diplosaurus
  • 2,538
  • 5
  • 25
  • 53
5
votes
1 answer

Content-Disposition filename in Chinese not supported

I have been trying to download attachment with Chinese filename but somehow their encoding changes while downloading and some gibberish filename is saved where there are Chinese chararchters. Technology: Java Server: Apache Tomcat This is what I've…
5
votes
2 answers

Python default string encoding

When, where and how does Python implicitly apply encodings to strings or does implicit transcodings (conversions)? And what are those "default" (i.e., implied) encodings? For example, what are the encodings: of string literals? s = "Byte string…
ivan_pozdeev
  • 33,874
  • 19
  • 107
  • 152
5
votes
2 answers

How to navigate to URLs with \u in them?

I have come across URLs which have \u Unicode characters within them, such as the following (note that this will not map to a valid page - it is just an example). http://my_site_name.com/\u0442\uab86\u0454\uab8eR-\u0454\u043d-\u043c/23795908 How can…
aBlaze
  • 2,436
  • 2
  • 31
  • 63
5
votes
2 answers

JS File upload: Detect Encoding

So, I'm trying to write a CSV-file importer using AngularJS on the frontend side and NodeJS for the backend. My problem is, that I'm not sure about the encoding of the incoming CSV files. Is there a way to automatically detect it? I first tried to…
DCH
  • 199
  • 2
  • 3
  • 13
5
votes
1 answer

How Can I Preserve Character Entities In .Net XDocument?

I'm porting a set of services to .Net 4.0 and have discovered (much to my dismay) that character entities I'm creating and storing in XElement.Value()'s are being "restored" to their original character values when I convert the XDocument object into…
jerhewet
  • 1,186
  • 1
  • 10
  • 19