9

I am having trouble searching for arabic text in mysql. I have in database a row containing a record

display_name
أحمد

But when I try to do a query with

SELECT * FROM wp_users WHERE display_name LIKE '%احمد%'

I tried to add at the end of the query

collate utf8_bin

But it didn't work either. How can I have

احمد == أحمد
Funk Forty Niner
  • 74,450
  • 15
  • 68
  • 141
Ahmed
  • 636
  • 1
  • 14
  • 31
  • 1
    Just before your SELECT put `$con->set_charset("utf8");` --- `$con` is a DB connection variable. Change it to the one you're using. That could/should work. – Funk Forty Niner Apr 24 '14 at 14:57
  • I am using wordpress. $wpdb->get_results() – Ahmed Apr 24 '14 at 14:58
  • What is your DB connection variable? I use for example `$con = new mysqli(MYSQL_SERVER, MYSQL_USER, MYSQL_PASSWORD, MYSQL_DB);` – Funk Forty Niner Apr 24 '14 at 14:59
  • In wordpress you don't have the connection variable. You access the database through the wordpress database api. – Ahmed Apr 24 '14 at 15:00
  • Ok, I know next to nothing about Wordpress. I added the relevant tag to your question. – Funk Forty Niner Apr 24 '14 at 15:01
  • Is there anyway I can set charset in heidisql when doing select? – Ahmed Apr 24 '14 at 15:02
  • I don't know about heidisql. There surely must be some form of DB connection variable somewhere. That's what you need to find. – Funk Forty Niner Apr 24 '14 at 15:09
  • 1
    I think there is no way else storing those names normalized first. Or before search, you should set a list of similar characters such as `اأإآ` and remove them from the beginning of the string then use like. However, this is a limited solution for only letters at the beginning or at the end of the string. – SaidbakR Apr 24 '14 at 23:28
  • 1
    @Ahmed You have to get look at http://www.ar-php.org/ – SaidbakR Apr 24 '14 at 23:45

1 Answers1

11

I don't have an exact solution, but I can tell you why it's not working. If you want those two strings to be considered equal, you need to use a different collation since utf8_bin compares exact code points, and those two strings are clearly not identical when considered that way. Normally MySQL's utf8_general_ci collation would provide transliteration and normalisation, for example all these match:

SELECT 'a'='A' COLLATE utf8_general_ci;
SELECT 'ü'='u' COLLATE utf8_general_ci;
SELECT 'ß'='ss' COLLATE utf8_general_ci;

but in your case it doesn't work, and nor does the more accurate utf8_unicode_ci collation:

SELECT 'احمد'='أحمد' COLLATE utf8_general_ci;
SELECT 'احمد'='أحمد' COLLATE utf8_unicode_ci;

This chart shows the character mapping for middle eastern languages in MySQL's utf8_unicode_ci collation, and you can see that the أ and ا characters are not considered equal, so MySQL's default collations won't solve this problem.

To work around this you have two options: normalise your strings before they hit MySQL (i.e. in PHP), or extend MySQL to provide an appropriate collation to do what you need.

The Ar-PHP project can help with the former, as sємsєм suggested. You should store your real user name and the normalised one separately so that you can search on one and display the other. Another project also provides a way of rephrasing arabic strings to work better in MySQL.

MySQL docs show how to create a custom collation. It essentially involves editing an LDML XML file (there is at least a BBEdit plugin to help with this) and giving it to MySQL. This will allow you to create a mapping that lets you treat some characters as equivalents. The advantage of this approach is that it's transparent to PHP and you don't need any additional columns in your database. If you build such a mapping, it would be beneficial to other Arabic users across multiple programming languages, not just PHP.

Synchro
  • 35,538
  • 15
  • 81
  • 104
  • 3
    I have created a custom collation utf8_arabic, if anyone else is running in the same problem https://gist.github.com/ahmednasir91/0cf805b5843b295e8959 – Ahmed May 01 '14 at 13:15