Questions tagged [utf-8]

UTF-8 is a multibyte character encoding of the Unicode character set, made up of one or more bytes. Unlike some other encodings such as UTF-16, the UTF-8 encoding is upward compatible with 7-bit ASCII characters, and can be processed to some degree by applications that are only aware of bytes.

Full support of UTF-8 for searching, collation, word parsing, etc, does require support of Unicode concepts such as characters, normalisation, supplementary characters, etc. Many application and OS problems with "special characters" such as accented European letters, or ideographs such as used in Japanese or Chinese, derive from mismatched character encodings.

Related tags:

104 questions
7
votes
2 answers

Linux not interpreting UTF8 encoded characters

So, having the following file Adán-y-Eva-50x50.jpg when I try to access it, apache translates it to Ad\xc3\xa1n-y-Eva-50x50.jpg and won't find it, even though it exists. This happens only for filenames that contain UTF8 characters. I have already…
w0rldart
  • 217
  • 1
  • 2
  • 14
7
votes
3 answers

UTF-8 and !# shell scripts

Is there a way to configure bash on Linux (red hat and ubuntu) to allow shell scripts to be encoded in UTF-8? I can't find a simple way to change just one little thing and have the whole system just use UTF-8 files without having to worry about…
sal
  • 827
  • 3
  • 12
  • 18
6
votes
2 answers

Is there a difference between en_US.utf8 and en_US.UTF-8?

Server info (DNS and IPs removed): cat /proc/version && uname -a && java -version Linux version 2.6.16.33-xenU (*************) (gcc version 4.1.1 20070105 (Red Hat 4.1.1-52)) #2 SMP Wed Aug 15 17:27:36 SAST 2007 Linux *************…
Matthew Herbst
  • 167
  • 1
  • 7
5
votes
2 answers

Mount unix samba 4 share to osx client without mangled file names

I have a unix server (arch linux) with samba 4.1.12. The share has files inside it with utf-8 nfc encoding (standard utf-8). When I mount this samba share into an osx client (10.9.5), files with special names like File with "quotes" are displayed…
Markus
  • 151
  • 2
  • 10
5
votes
3 answers

How to forbid non-UTF-8 filenames?

Is it possible to enforce, at filesystem level, that all created file entries will have valid UTF-8 names? I am using Btrfs.
lvella
  • 314
  • 2
  • 13
5
votes
2 answers

Are there any SMTP servers that support the SMTPUTF8 extension?

RFC 5321 limits email addresses to 7 bit US-ASCII encoding. RFC 6531 (a fairly new spec) allows email addresses in SMTP commands and IMF headers to be encoded in UTF-8. This SMTP extension makes internationalized email addresses (e.g.,…
james.garriss
  • 360
  • 6
  • 17
5
votes
1 answer

Using UTF-8 in the /etc/passwd file. Any known issues?

I was recently asked to modify the GECOS field in the passwd file for a certain user so that it will contain his name with his original accented characters. My first was reaction was "sure, why not?" but then I started getting paranoid that there…
danakim
  • 392
  • 2
  • 9
5
votes
1 answer

How to match Japanese in spamassassin?

I live in Japan. Recently there has been a lot of spam coming from China with messages written in Chinese. As spamassassin does not contain rules for Chinese, most of those emails pass with low score. I would like to identify when an email is…
lepe
  • 469
  • 2
  • 6
  • 25
5
votes
2 answers

Syntax for apache RewriteRule to match %-encoded URLs? (to fix character encoding issues; windows-1252 <=> utf-8 )

I host a webpage that has 'project²' in the URL, matching an on-disk directory project² from where static files are hosted. This page is used by a java-based client to load data from URLs (bioinformatics software IGV). My page lists URLs in the…
4
votes
1 answer

UTF-8 character decoding problems after upgrading from Windows 2008 R2 to Windows 2016 Server

My test server VM has been upgraded by corporate IT from Windows 2008 R2 to Windows 2016 Server (via 2012). I've problems with running some of my tests now and tracked the issue down to character encoding issues. The easiest thing to reproduce is…
Scrontch
  • 161
  • 4
4
votes
1 answer

Rewriting ASCII-percent-encoded locations to their UTF-8 encoded equivalent

For example, “å” can be encoded as /%E5 and /%C3%A5 (utf-8). All my filenames are UTF-8, so the ASCII variants return a 404. I want both variants to work. I have tried rewriting the incorrect URLs to the correct encodings with variations of the…
Daniel
  • 211
  • 3
  • 16
4
votes
1 answer

Broken characters in filenames only in some directories

We have a web server running CentOS 5.8 that uses SVN for version control. When trying to switch to the latest revision, we got an error about the filenames of files in an upload directory: svn: Error converting entry in directory…
Kaivosukeltaja
  • 205
  • 1
  • 8
4
votes
1 answer

problem executing a bash script with utf8 encoding

I have a bash script encoded in utf8 . Within the script i use sed command using § as a separator . Now when i run execute this script sed complains about the separator. If i use normal char as a separator for ex @ then everything works. I have…
Inv3r53
  • 163
  • 1
  • 8
4
votes
3 answers

Set character_set_results UTF8 in MySQL my.cnf

how can I set the Variable character_set_results from latin1 to uft8? I thought it would be enough to add the following variable in my.cnf: default-character-set=utf8 But it not seem so: mysql> SHOW VARIABLES LIKE…
Marc
  • 51
  • 1
  • 1
  • 3
3
votes
2 answers

UTF-8 Characters in Apache Access Log ✔

The issue I'm using PHP's apache_note() to log variables from web requests to a CustomLog format. However, try as I might, Apache doesn't want to log UTF-8 characters the way I'd like. In PHP, I have apache_note('some_value', '✔'); which corresponds…
Bill Huertas
  • 131
  • 1
  • 4