Select unicode character u2028 in mysql 5.1

Question

I am trying to select unicdode character /u2028 in MySQL 5.1. MySQL 5.1 does support utf8 and ucs2.

In newer versions of MySQL i could select the char just be using utf16 or utf32 collation:

SELECT char(0x2028 using utf16);
SELECT char(0x00002028 using utf32);

But MySQL 5.1 do not support utf16 and utf32. How could I select the unicode character then?

Perhaps a few words about my use case: I have an third party application which stores data in a mysql database and using JavaScript for user interface. The application do not deal with the problem unicode characters /u2028 and /u2029 are valid JSON but will break JavaScript code. (For details see http://timelessrepo.com/json-isnt-a-javascript-subset) So I like to know how much data is affected by that issue and perhaps use replace on MySQL to fix it.

To demonstrate the problem:

CREATE TABLE IF NOT EXISTS `test` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `string` varchar(100) CHARACTER SET utf8 NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB  DEFAULT CHARSET=utf8 AUTO_INCREMENT=3 ;

INSERT INTO `test` (`id`, `string`) VALUES
(1, 'without U+2028'),
(2, 'with U+2028 at this "     "point');

SELECT * FROM test WHERE string LIKE CONCAT("%", char(0x2028 using utf16), "%");
// returns row 2 as expected

SELECT * FROM test WHERE string LIKE CONCAT("%", char(??? using utf8), "%");
// U+2028 in utf8 is 0xE2 0x80 0xA8 isn't it?
// But how to parse this to char function?

Not quite sure what the problem is. U+2028 is a character perfectly supported by the `utf8` encoding in MySQL. However, the `CHAR` function and its arguments is somewhat of a different topic. Why exactly to you want to convert the literal `0x2028` to the UTF-8 encoded U+2028 Unicode character? Is this *really* the problem you're trying to solve? — deceze, Jan 20 '14 at 12:21
U+2028 LINE SEPARATOR is rarely used. Are you sure you need to deal with it? It may cause problems in JavaScript, but how does this relate to MySQL? — Jukka K. Korpela, Jan 20 '14 at 12:34
@JukkaK.Korpela As I wrote in my use case I have to deal with U+2028 and U+2029 just because of problems they cause in JavaScript. I need to remove these characters of a MySQL database of a third party application because of JavaScript problems in the frontend. — jelhan, Jan 31 '14 at 13:07
@deceze: I added a demonstration of the problem to the question. In the end I like to write a find and replace query for all U+2028 and U+2029 characters. — jelhan, Jan 31 '14 at 13:22
I have more or less the same problem with 2028. Would be interested in any solution. — Pistos, May 14 '14 at 22:58

Tim Rogers · Accepted Answer · 2020-05-01T21:01:10.640

11

The unicode character U+2028 can be encoded in UTF-8 as hexadecimal e280a8. So the answer is to use the UNHEX function in MySQL to find it.

SELECT * FROM test WHERE string LIKE CONCAT("%", UNHEX('e280a8'), "%");

MySQL 5.1 can only handle characters encloded in UTF-8 up to three bytes long. So searching for U+2028 using UNHEX will work, but searching for U+1F600 won't as that takes up four bytes.

Use UNHEX('e280a9') to search for U+2029.

To find other characters, visit https://fileformat.info/info/unicode/char/2028/index.htm, substituting '2028' for the character you are looking for. Look for the the number in brackets in the 'UTF-8 (hex)' row.

edited May 01 '20 at 21:01

answered Feb 08 '15 at 22:02

Tim Rogers

426
5
14

2

Where do you get / calculate the mapping between U+2028 and e280a8 ? I have other characters I need to work on. – Robert Mar 29 '17 at 22:57
@Robert I've updated the answer to explain how to get the mappings for other characters. – Tim Rogers May 01 '20 at 21:04

Select unicode character u2028 in mysql 5.1

1 Answers1