Questions tagged [diacritics]

A Diacritic is "a mark near or through an orthographic or phonetic character or combination of characters indicating a phonetic value different from that given the unmarked or otherwise marked element" -- Merriam-Webster

From Wikipedia:

A diacritic (/daɪ.əˈkrɪtɨk/; also diacritical mark, diacritical point, diacritical sign) is a glyph added to a letter, or basic glyph. The term derives from the Greek διακριτικός (diakritikós, "distinguishing"). Diacritic is both an adjective and a noun, whereas diacritical is only an adjective. Some diacritical marks, such as the acute ( ´ ) and grave ( ` ) are often called accents. Diacritical marks may appear above or below a letter, or in some other position such as within the letter or between two letters.

The main use of diacritics in the Latin alphabet is to change the sound value of the letter to which they are added. Examples from English are the diaeresis in naïve and Noël, which show that the vowel with the diaeresis mark is pronounced separately from the preceding vowel; the acute and grave accents, which indicate that a final vowel is to be pronounced, as in saké and poetic breathèd, and the cedilla under the "c" in the borrowed French word façade, which shows it is pronounced /s/ rather than /k/. In other Latin alphabets, they may distinguish between homonyms, such as French là "there" versus la "the," which are both pronounced [la]. In Gaelic type, a dot over consonants indicates lenition of the consonant in question. In other alphabetic systems, diacritics may perform other functions. Vowel pointing systems, namely the Arabic harakat ( ـَ, ـُ, ـُ, etc.) and the Hebrew niqqud ( ַ, ֶ, ִ, ֹ , ֻ, etc.) systems, indicate sounds (vowels and tones) that are not conveyed by the basic alphabet. The Indic virama ( ् etc.) and the Arabic sukūn ( ـْـ ) mark the absence of a vowel. Cantillation marks indicate prosody. Other uses include the Early Cyrillic titlo ( ◌҃ ) and the Hebrew gershayim ( ״ ), which, respectively, mark abbreviations or acronyms, and Greek diacritics, which showed that letters of the alphabet were being used as numerals.

In orthography and collation, a letter modified by a diacritic may be treated either as a new, distinct letter or as a letter–diacritic combination. This varies from language to language, and may vary from case to case within a language.

In some cases, letters are used as "in-line diacritics" in place of ancillary glyphs, because they modify the sound of the letter preceding them, as in the case of the "h" in English "sh" and "th".

More information

1105 questions
22
votes
4 answers

Python and character normalization

Hello I retrieve text based utf8 data from a foreign source which contains special chars such as u"ıöüç" while I want to normalize them to English such as "ıöüç" -> "iouc" . What would be the best way to achieve this ?
Hellnar
  • 62,315
  • 79
  • 204
  • 279
22
votes
4 answers

Remove Arabic Diacritic

I want php to convert this... Text : الْحَمْدُ لِلَّهِ رَبِّ الْعَالَمِينَ converted to : الحمد لله رب العالمين I am not sure where to start and how to do it. Absolutely no idea. I have done some research, found this link…
Syed Sajid
  • 1,380
  • 5
  • 20
  • 34
22
votes
5 answers

How can Z͎̠͗ͣḁ̵͙̑l͖͙̫̲̉̃ͦ̾͊ͬ̀g͔̤̞͓̐̓̒̽o͓̳͇̔ͥ text be prevented?

I've read about how Zalgo text works, and I'm looking to learn how a chat or forum software could prevent that kind of annoyance. More precisely, what is the complete set of Unicode combining characters that needs to: a) either be stripped, assuming…
Dan Dascalescu
  • 143,271
  • 52
  • 317
  • 404
21
votes
7 answers

Regex accent insensitive?

I need a Regex in a C# program. I've to capture a name of a file with a specific structure. I used the \w char class, but the problem is that this class doesn't match any accented char. Then how to do this? I just don't want to put the most used…
J4N
  • 19,480
  • 39
  • 187
  • 340
20
votes
3 answers

ModuleNotFoundError: No module named 'unidecode' yet I have the module installed

I am trying to remove accents from a Python list of strings by converting it from UTF-8 to ASCII. I have read answers to multiple questions here in StackOverflow that suggest using the unidecode function from the unidecode package. I have installed…
Felipe Ito
  • 237
  • 1
  • 2
  • 5
19
votes
6 answers

Regex to remove non-letter characters but keep accented letters

I have strings in Spanish and other languages that may contain generic special characters like (),*, etc. That I need to remove. But the problem is that it also may contain special language characters like ñ, á, ó, í etc and they need to remain. So…
devjs11
  • 1,898
  • 7
  • 43
  • 73
19
votes
2 answers

python : working with german umlaut

months = ["Januar", "Februar", "März", "April", "Mai", "Juni", "Juli", "August", "September", "Oktober", "November", "Dezember"] print months[2].decode("utf-8") Printing month[2] fails with UnicodeDecodeError: 'utf8' codec can't decode bytes in…
deimus
  • 9,565
  • 12
  • 63
  • 107
19
votes
6 answers

How to handle diacritics (accents) when rewriting 'pretty URLs'

I rewrite URLs to include the title of user generated travelblogs. I do this for both readability of URLs and SEO purposes. http://www.example.com/gallery/280-Gorges_du_Todra/ The first integer is the id, the rest is for us humans (but is…
Jacco
  • 23,534
  • 17
  • 88
  • 105
19
votes
5 answers

Why doesn't Đ get flattened to D when Removing Accents/Diacritics

I'm using this method to remove accents from my strings: static string RemoveAccents(string input) { string normalized = input.Normalize(NormalizationForm.FormKD); StringBuilder builder = new StringBuilder(); foreach (char c in…
Mladen Prajdic
  • 15,457
  • 2
  • 43
  • 51
19
votes
5 answers

normalizing accented characters in MySQL queries

I'd like to be able to do queries that normalize accented characters, so that for example: é, è, and ê are all treated as 'e', in queries using '=' and 'like'. I have a row with username field set to 'rené', and I'd like to be able to match on it…
George Armhold
  • 30,824
  • 50
  • 153
  • 232
19
votes
9 answers

ToAscii/ToUnicode in a keyboard hook destroys dead keys

It seems that if you call ToAscii() or ToUnicode() while in a global WH_KEYBOARD_LL hook, and a dead-key is pressed, it will be 'destroyed'. For example, say you've configured your input language in Windows as Spanish, and you want to type an…
00010000
  • 323
  • 1
  • 4
  • 13
18
votes
4 answers

Mongodb match accented characters as underlying character

In MongoDB "db.foo.find()" syntax, how can I tell it to match all letters and their accented versions? For example, if I have a list of names in my database: João François Jesús How would I allow a search for the strings "Joao", "Francois", or…
Josh
  • 4,412
  • 7
  • 38
  • 41
18
votes
3 answers

Character encoding for French Accents

I'm developing my first website for a French client and I'm having massive issues with accents being displayed as "?".After googling it for days, I thought I understood, but issues persists. To simplify it, I'll explain just the email headers (the…
denislexic
  • 10,786
  • 23
  • 84
  • 128
17
votes
3 answers

Should all accented characters use html entities?

I am working with a large number of HTML files that are mostly encoded as utf-8. There are accented characters galore as many are in French. I have been converting them to HTML entities as I go, but I noticed that even in IE5.5 (according IE tester)…
Damon
  • 10,493
  • 16
  • 86
  • 144
17
votes
1 answer

What's the correct algorithm to determine number of user-perceived-characters?

I have the task of counting the number of perceived characters in an input. The input is a group of ints (we can think of it as an int[]) which represents Unicode code points. java.text.BreakIterator.getCharacterInstance() is not allowed. (I mean…
Pacerier
  • 86,231
  • 106
  • 366
  • 634
1 2
3
73 74