How to convert strange strong/bold Unicode to non bold UTF-8 chars in php?

Question

I'm trying to store a tweet in my database with twitter api, but I get this kind of strage chars which seems to be "naturals" bold chars

NORMAL CHARS:

azertyuio

STRANGE CHARS:

!!

If I paste the strongs chars in my netbeans editor I get something like square chars...

I've never seen that before. Could you help me to convert this text in a non bold chars in php?

What database? What is the table structure, and specifically the character set/collation you are using? This looks like a character set issue. It seems that you need to be using UTF-8 within your php client script and for storage in the field in your table. See this question: http://stackoverflow.com/questions/8274972/official-encoding-used-by-twitter-streaming-api-is-it-utf-8 — gview, Feb 15 '17 at 16:05
for example var_dump(ord('')); //return 240 var_dump(ord('s')); //return 115 — J. Doe, Feb 15 '17 at 16:05
These are unicode characters, specifically `MATHEMATICAL SANS-SERIF BOLD SMALL` from `U+1D400` to `U+1D7FF`. — Benedict Lewis, Feb 15 '17 at 16:23
ok thanks but how can I convert its chars to "classic" chars ? So strange ... why twitter use this kind of chars ? — J. Doe, Feb 15 '17 at 16:28
Can you call `iconv` or a related library/plug-in for PHP? `$ echo | iconv -f UTF-8 -t ASCII//TRANSLIT` yields `set is ready for the discussion`. — Ken Sharp, Dec 06 '17 at 16:29

score 7 · Answer 1 · answered Jul 24 '20 at 07:23

Using http://slothsoft.net/getResource.php/slothsoft/unicode-mapper source, I made a function:

public function convertSpecialCharToNormalChar($text) {
    $target = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '!', '?', '.', ',', '"', "'"];
    $specialList = [
        'serifBold' => ['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '❗', '❓', '.', ',', '"', "'"],
        'serifItalic' => ['', '', '', '', '', '', '', 'ℎ', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '!', '?', '.', ',', '"', "'"],
        'serifBoldItalic' => ['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '❗', '❓', '.', ',', '"', "'"],
        'sans' => ['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '!', '?', '.', ',', '"', "'"],
        'sansBold' => ['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '❗', '❓', '.', ',', '"', "'"],
        'sansItalic' => ['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '!', '?', '.', ',', '"', "'"],
        'sansBoldItalic' => ['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '❗', '❓', '.', ',', '"', "'"],
        'script' => ['', '', '', '', 'ℯ', '', 'ℊ', '', '', '', '', '', '', '', 'ℴ', '', '', '', '', '', '', '', '', '', '', '', '', 'ℬ', '', '', 'ℰ', 'ℱ', '', 'ℋ', 'ℐ', '', '', 'ℒ', 'ℳ', '', '', '', '', 'ℛ', '', '', '', '', '', '', '', '', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '!', '?', '.', ',', '"', "'"],
        'scriptBold' => ['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '❗', '❓', '.', ',', '"', "'"],
        'fraktur' => ['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'ℭ', '', '', '', '', 'ℌ', 'ℑ', '', '', '', '', '', '', '', '', 'ℜ', '', '', '', '', '', '', '', 'ℨ', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '!', '?', '.', ',', '"', "'"],
        'frakturBold' => ['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '❗', '❓', '.', ',', '"', "'"],
        'monospace' => ['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '！', '？', '．', '，', '"', '＇'],
        'fullwidth' => ['ａ', 'ｂ', 'ｃ', 'ｄ', 'ｅ', 'ｆ', 'ｇ', 'ｈ', 'ｉ', 'ｊ', 'ｋ', 'ｌ', 'ｍ', 'ｎ', 'ｏ', 'ｐ', 'ｑ', 'ｒ', 'ｓ', 'ｔ', 'ｕ', 'ｖ', 'ｗ', 'ｘ', 'ｙ', 'ｚ', 'Ａ', 'Ｂ', 'Ｃ', 'Ｄ', 'Ｅ', 'Ｆ', 'Ｇ', 'Ｈ', 'Ｉ', 'Ｊ', 'Ｋ', 'Ｌ', 'Ｍ', 'Ｎ', 'Ｏ', 'Ｐ', 'Ｑ', 'Ｒ', 'Ｓ', 'Ｔ', 'Ｕ', 'Ｖ', 'Ｗ', 'Ｘ', 'Ｙ', 'Ｚ', '０', '１', '２', '３', '４', '５', '６', '７', '８', '９', '！', '？', '．', '，', '"', '＇'],
        'doublestruck' => ['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'ℂ', '', '', '', '', 'ℍ', '', '', '', '', '', 'ℕ', '', 'ℙ', 'ℚ', 'ℝ', '', '', '', '', '', '', '', 'ℤ', '', '', '', '', '', '', '', '', '', '', '❕', '❔', '.', ',', '"', "'"],
        'capitalized' => ['ᴀ', 'ʙ', 'ᴄ', 'ᴅ', 'ᴇ', 'ꜰ', 'ɢ', 'ʜ', 'ɪ', 'ᴊ', 'ᴋ', 'ʟ', 'ᴍ', 'ɴ', 'ᴏ', 'ᴘ', 'q', 'ʀ', 'ꜱ', 'ᴛ', 'ᴜ', 'ᴠ', 'ᴡ', 'x', 'ʏ', 'ᴢ', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '﹗', '﹖', '﹒', '﹐', '"', "'"],
        'circled' => ['ⓐ', 'ⓑ', 'ⓒ', 'ⓓ', 'ⓔ', 'ⓕ', 'ⓖ', 'ⓗ', 'ⓘ', 'ⓙ', 'ⓚ', 'ⓛ', 'ⓜ', 'ⓝ', 'ⓞ', 'ⓟ', 'ⓠ', 'ⓡ', 'ⓢ', 'ⓣ', 'ⓤ', 'ⓥ', 'ⓦ', 'ⓧ', 'ⓨ', 'ⓩ', 'Ⓐ', 'Ⓑ', 'Ⓒ', 'Ⓓ', 'Ⓔ', 'Ⓕ', 'Ⓖ', 'Ⓗ', 'Ⓘ', 'Ⓙ', 'Ⓚ', 'Ⓛ', 'Ⓜ', 'Ⓝ', 'Ⓞ', 'Ⓟ', 'Ⓠ', 'Ⓡ', 'Ⓢ', 'Ⓣ', 'Ⓤ', 'Ⓥ', 'Ⓦ', 'Ⓧ', 'Ⓨ', 'Ⓩ', '⓪', '①', '②', '③', '④', '⑤', '⑥', '⑦', '⑧', '⑨', '!', '?', '.', ',', '"', "'"],
        'parenthesized' => ['⒜', '⒝', '⒞', '⒟', '⒠', '⒡', '⒢', '⒣', '⒤', '⒥', '⒦', '⒧', '⒨', '⒩', '⒪', '⒫', '⒬', '⒭', '⒮', '⒯', '⒰', '⒱', '⒲', '⒳', '⒴', '⒵', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '⓿', '⑴', '⑵', '⑶', '⑷', '⑸', '⑹', '⑺', '⑻', '⑼', '!', '?', '.', ',', '"', "'"],
        'underlinedSingle' => ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '!', '?', '.', ',', '"', "'"],
        'underlinedDouble' => ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '!', '?', '.', ',', '"', "'"],
        'strikethroughSingle' => ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '!', '?', '.', ',', '"', "'"],
        'crosshatch' => ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '!', '?', '.', ',', '"', "'"],
    ];

    foreach ($specialList as $list) {
        $text = str_replace($list, $target, $text);
    }

    return $text;
}

I had MySQL field with utf-8 encoding. Some special chars have mb4 or even some mathematical chars here are having utf-16. I was not able to upgrade the column due to limitations, but this function was the saviour. — Ruchit Patel, Dec 21 '20 at 09:57
I feel bad using it, but it worked :D Used it to have bold text in an input. — ALZlper, Jan 05 '21 at 15:29
I guess "the right way" would have been a div (optionally with contenteditable), but with this solution I didn'T have to rebuilt all my forms — ALZlper, Jan 06 '21 at 22:52

score 1 · Answer 2 · answered Mar 27 '22 at 20:35

1

If you don't mind using command line: Using Jacob's mapping and some Perl magic, you can convert between char sets (for serifBold):

$ echo "" | perl -Mopen=locale -Mutf8 -pe 'y/-/a-z/' | perl -Mopen=locale -Mutf8 -pe 'y/-/A-Z/'
Hello

Piping from and to xsel -b (pbcopy on Mac) you can convert any text currently on system clipboard.

answered Mar 27 '22 at 20:35

Pablo Bianchi

1,824
1
26
30

How about numbering? :-) – Ken Sharp Feb 21 '23 at 20:49
@KenSharp I don't know which Unicode block do you mean, but would be the same, map that range to 0-9. – Pablo Bianchi Feb 22 '23 at 01:50

score 0 · Accepted Answer · answered Feb 15 '17 at 18:34

This is one of the reasons for using UTF or HTML entity character encoding rather than ansi. UTF allows you to store and display characters like these (and those from other languages), handle searches when someone inputs these characters in those languages/charsets (which will only match things written in those same characters), and so on.

The alternative would be for you to write a "conversion" for every odd character set that people choose to use. Still, converting these is possible to do -- you'll just need to decide whether it is really worth your time.

The characters you submitted are called Sans-Serif Mathematical Bold characters. You can find the list here at w3.org. As well, there are standard, slanted, slanted bold variations for just these (use the previous and next links at the top of that page).

The problem you will encounter is that, unlike switching capitalized characters to lowercase (add 32 to the decimal value, or chr(ord(x)+32) ) there won't be a set decimal amount you can use to switch all characters from Mathematical Bold to an ANSI equivalent for each of the character groups. As well, ord() and chr() will not work for these characters.

Example:

is 120302, a is 97. 120302 - 97 = 120205
is 120276, A is 65. 120276 - 65 = 120211

Thus, subtracting 120205 would give you the correct lowercase a for , however, the same would not work for . That means your would have to determine which charset the character is (Mathematical Bold, Slanted Mathematical, etc), identify the subset it belongs to (a-z, A-Z, 0-9), then use a corresponding offset you calculated to correct it. In order to do that, you have to check every character of every tweet for characters that fit in one of your supported conversion charsets, then convert it those letters.

That might be worth doing if there are a large number of tweets using Mathematical Bold only, but if you're importing large sets of tweets *that can contain all sorts of potential characters, you're in for a lot of work.

If you think it is worthwhile, the first thing you'll need to do is look at the raw character encoding you're receiving from the API, whether it needs to be converted, then decide whether you want to map between charsets using an array of characters, use a range of values for the subsets, or some other method. You also need to decide how you'll scan for those characters.

All in all, the answer to your question is that it is possible to convert them, but your situation and particulars are going to determine whether it is worthwhile and how you accomplish it. It's not something that can be written for you.

woowww ! Big thanks for this reply :) Now I understand ;) I will look it I can find a function on the web for this issue (but I doubt ... ) I keep you informed ;) thanks — J. Doe, Feb 20 '17 at 13:35
And for info tweets with this kind of chars are made by an extension not natively by twitter — J. Doe, Feb 20 '17 at 13:36
@J.Doe FYI. The problem you are facing could be similarly described as trying to convert Emoji to words. Rather than Emoji, you are attempting to handle characters. In either case, what would be required is the same -- you would need to know every Emoji for every type of phone, and every corresponding word to replace it with. Same goes for various charsets and the intended characters they should be replaced by. Edit: I say the same, because from the perspective of the computer they are the same -- simply unicode characters. — Jacob S, Feb 20 '17 at 18:48

How to convert strange strong/bold Unicode to non bold UTF-8 chars in php?

3 Answers3

Linked