Questions tagged [utf-8]

UTF-8 is a character encoding that describes each Unicode code point using a byte sequence of one to four bytes. It is backwards-compatible with ASCII while still supporting representation of all Unicode code points.

UTF-8 is a that can describe the set of code points in byte sequences of one to four bytes.

UTF-8 is the most widely used character encoding, and is recommended for use on the Internet. It is the standard character encoding on and other recent -like operating systems. It was designed to be backwards-compatible with while still supporting representation of all Unicode code points.

The algorithm for encoding code points in UTF-8 is described in RFC 3629.

Related tags

22178 questions
9
votes
4 answers

How to convert a UTF-8 string into Unicode?

I have string that displays UTF-8 encoded characters, and I want to convert it back to Unicode. For now, my implementation is the following: public static string DecodeFromUtf8(this string utf8String) { // read the string as UTF-8 bytes. …
remio
  • 1,242
  • 2
  • 15
  • 36
9
votes
2 answers

Character encoding with Ruby 1.9.3 and the mail gem

I'm trying to parse email strings with the Ruby mail gem, and I'm having a devil of a time with character encodings. Take the following email: MIME-Version: 1.0 Sender: foobar@example.com Received: by 10.142.239.17 with HTTP; Thu, 14 Jun 2012…
Micah
  • 17,584
  • 8
  • 40
  • 46
9
votes
1 answer

perl: convert a string to utf-8 for json decode

I'm crawling a website and collecting information from its JSON. The results are saved in a hash. But some of the pages give me "malformed UTF-8 character in JSON string" error. I notice that the last letter in "cafe" will produce error. I think it…
Ivan Wang
  • 8,306
  • 14
  • 44
  • 56
9
votes
3 answers

python regular expression with utf8 issue

I got a file which includes many lines of plain utf-8 text. Such as below, by the by, it's Chinese. PROCESS:类型:关爱积分[NOTIFY] 交易号:2012022900000109 订单号:W12022910079166 交易金额:0.01元 交易状态:true 2012-2-29 10:13:08 The file itself was saved in…
castiel
  • 2,675
  • 5
  • 29
  • 38
9
votes
3 answers

Why can't I write Chinese characters in nodejs HTTP response?

Here is my little code: var http = require('http'); var port = 9002; var host_ip = ''; http.createServer(function (req, res) { var content = new Buffer("Hello 世界", "utf-8") console.log('request arrived'); res.writeHead(200, { …
Allan Ruin
  • 5,229
  • 7
  • 37
  • 42
9
votes
4 answers

Why is DOCTYPE line red in firefox?

The websites I've designed had no problem before but now I see DOCTYPE line red in Firefox 11. There is no problem in validation. I changed encoding to UTF-8 without BOM but problem still…
HasanG
  • 12,734
  • 29
  • 100
  • 154
8
votes
5 answers

"an integer is required" when open()'ing a file as utf-8?

I have a file I'm trying to open up in python with the following line: f = open("C:/data/lastfm-dataset-360k/test_data.tsv", "r", "utf-8") Calling this gives me the error TypeError: an integer is required I deleted all other code besides that one…
Jim
  • 4,509
  • 16
  • 50
  • 80
8
votes
3 answers

How do I convert a UTF-8 string to upper case?

Is there a portable way to convert a UTF-8 string in C to upper case? If not, what is the Linux way to do it?
August Karlstrom
  • 10,773
  • 7
  • 38
  • 60
8
votes
2 answers

Storing Chinese, Korean, English, etc in MS SQL through SQL Express

I am using MS SQL 2008 Express to connect to a shared MS SQL 2008 server where I have a database. The default collation for the DB is currently SQL_Latin1_General_CP1_CI_AS. Ultimately, I would like to store English, Korean, Chinese, and any other…
gcdev
  • 1,406
  • 3
  • 17
  • 30
8
votes
5 answers

UTF-8 problem in python when reading chars

I'm using Python 2.5. What is going on here? What have I misunderstood? How can I fix it? in.txt: Stäckövérfløw code.py #!/usr/bin/env python # -*- coding: utf-8 -*- print """Content-Type: text/plain; charset="UTF-8"\n""" f = open('in.txt','r') for…
jacob
  • 1,214
  • 2
  • 13
  • 22
8
votes
2 answers

Is UTF-8 the encoding of choice for QR-codes with non ASCII chars by now?

Google uses UTF-8 it as default for their very popular encoder. From what I can see they don't even add the byte order mark. The problem is that most scanners still seem to use JIS8 (QR 2000) instead of iso-8859 (QR 2005) as default, so it mostly…
Gonzo
  • 2,023
  • 3
  • 21
  • 30
8
votes
1 answer

kdiff3 doen not show uft8

I am using kdiff3 with TortoiseHg. When merging file in utf-8 encoding, kdiff3 show all non-latin text like "склад". How I can fix this?
Andrew G
  • 817
  • 1
  • 9
  • 13
8
votes
2 answers

Removing invalid/incomplete multibyte characters

I'm having some issues using the following code on user input: htmlentities($string, ENT_COMPAT, 'UTF-8'); When an invalid multibyte character is detected PHP throws a notice: PHP Warning: htmlentities(): Invalid multibyte sequence in argument in…
Dean
  • 5,884
  • 2
  • 18
  • 24
8
votes
3 answers

How to parse UTF-8 representation to String in Java?

Given the following code: String tmp = new String("\\u0068\\u0065\\u006c\\u006c\\u006f\\u000a"); String result = convertToEffectiveString(tmp); // result contain now "hello\n" Does the JDK already provide some classes for doing this ? Is there a…
Stephan
  • 41,764
  • 65
  • 238
  • 329
8
votes
3 answers

What's a good terminator byte for UTF-8 data?

I have a need to manipulate UTF-8 byte arrays in a low-level environment. The strings will be prefix-similar and kept in a container that exploits this (a trie.) To preserve this prefix-similarity as much as possible, I'd prefer to use a…
phs
  • 10,687
  • 4
  • 58
  • 84