Searching a big mysql database with relevance

Question

I'm building a rather large "search" engine for our company intranet, it has 1miljon plus entries it's running on a rather fast server and yet it takes up to 1 min for some search queries.

This is how the table looks

Table

I tried create an index for it, but it seems as if i'm missing something, this is how the show index is showing

idexes

and this is the query itself, it is the ordering that slows the query mostly but even a query without the sorting is somewhat slow.

SELECT SQL_CALC_FOUND_ROWS *
FROM `businessunit`
INNER JOIN `businessunit-postaddress` ON `businessunit`.`Id` = `businessunit-postaddress`.`BusinessUnit`
WHERE `businessunit`.`Name` LIKE 'tanto%'
ORDER BY `businessunit`.`Premium` DESC ,
CASE WHEN `businessunit`.`Name` = 'tanto'
THEN 0
WHEN `businessunit`.`Name` LIKE 'tanto %'
THEN 1
WHEN `businessunit`.`Name` LIKE 'tanto%'
THEN 2
ELSE 3
END , `businessunit`.`Name`
LIMIT 0 , 30

any help is very much appreciated

Edit: What's choking this query 99% is ordering by relevance with the wildcharacter % When i Do an explain it says using where; using fsort

Change query: remove some part of the query and run again (trial and error), you can find out by yourself which part is slowing down the result. — , Jan 28 '13 at 11:46
yes of course, but the problem remains, how should i keep ordering by relevancy and yet keep it fast — Breezer, Jan 28 '13 at 11:49
did you found the part that slows down? for example: remove all order by, then test, if its slow, then, its not order by, remove where clause and etc, you should identify the exact part — , Jan 28 '13 at 11:54
Describe the type of searches that your search system is meant to handle; it's quite likely that trying to implement it in SQL is the wrong way to go, as proper search solutions will always be more flexible and faster; see @minesh's answer below. — El Yobo, Jan 28 '13 at 11:57
Have you tried implementing a FULL_TEXT index? It will return results by relevance and is far quicker than 'LIKE ... %' which will not make use of indexes in all situations. http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html — thaJeztah, Feb 03 '13 at 16:51
Can you show us the structure of the other table it's joining on please? — EM-Creations, Feb 08 '13 at 15:09
"I'm building a rather large "search" engine" Why? There are search engines available to use for free - e.g. http://lucene.apache.org/core/ which will provide all the search functionality you need. Trying to write one in PHP + SQL is going to be inherently slow compared to those. — Danack, Feb 09 '13 at 02:59

score 19 · Accepted Answer · answered Jan 28 '13 at 11:52

19

You should try sphinx search solution which is full-text search engine will give you very good performance along with lots of options to set relevancy.

Click here for more details.

answered Jan 28 '13 at 11:52

Minesh

2,284
1
14
22

4

+1 for the point, fast and advanced search isn't easily done in SQL. I've personally had a great deal of success with [Solr](http://lucene.apache.org/solr/) which has a [good library for PHP (Solr PHP Client)](https://code.google.com/p/solr-php-client/). – El Yobo Jan 28 '13 at 11:56
Taking a look into it, seems intresting =) – Breezer Jan 28 '13 at 11:57
Depends on the type of search that you need to match; if you just need to match a lot of exact values, as suggested by your data, then SQL should perform well. – El Yobo Jan 28 '13 at 11:58
@ElYobo Yeah its just a match on a max 255 characher string, the regular company names that's around the world, as soon as I removed from ´case´ to ´end´ the query performed 1000 times faster, so its the ordering with relecancy that's choking it – Breezer Jan 28 '13 at 12:02
I ended up using this, it worked closest to what i wanted to accomplish – Breezer Feb 04 '13 at 11:11
Yeah, there is, I used this when I first worked with Solr (late 2009). Unfortunately I ended up having to deploy on Windows though, and there were no binaries available (at the time, this may have changed), so I switched to the Solr PHP Client library at the last minute. It turned out to be useful, as there were some extensions to the library ([Solr Cell support](https://code.google.com/p/solr-php-client/source/detail?r=34) was one) that I was able to write, and wouldn't have been able to in C! – El Yobo Feb 07 '13 at 20:28
But is the drawback of stopwords is applicable to this full-text search by sphinx search? Usually some common words results do not get retrieved by full-text search, example the word 'one' is a popular stopword and results containing this word are not retrieved. – Javier Brooklyn Feb 09 '13 at 05:40
@Breezer It seems to me that the question isn't related to full-text search. Please read my answer. – knagaev Feb 10 '13 at 13:20
I really liked this Sphinxsolution Since it alot easier to deploy on a windows enviroment, but I'm still trying to figure out things, since english isnt my main tongue, takes some time to get through the documentation to get started, any aid on that part is greatly appreciated aswell. You're welcome to join me here http://chat.stackoverflow.com/rooms/24391/mysql-sphinx, so this topic doesnt get anymore crowded =) – Breezer Feb 12 '13 at 16:54

score 6 · Answer 2 · answered Jan 28 '13 at 11:52

Seems like the index doesn't cover Premium, yet that is the first ORDER BY argument.

Use EXPLAIN your query here to figure out the query plan and change your index to remove any table scans as explained in http://dev.mysql.com/doc/refman/5.0/en/using-explain.html

score 6 · Answer 3 · answered Feb 07 '13 at 15:41

MySQL is good for storing data but not great when it comes down to fast text based search.

Apart from Sphinx which has been already suggested I recommend two fantastic search engines:

Solr with http://pecl.php.net/package/solr - very popular search engine. Used on massive services like NetFlix.
Elastic Search - relatively new software but with very active community and lots of respect

Both solution are based on the same library Apache Lucene

Can you suggest how can use with PHP +mysqli? – Manmeet Khurana Jun 12 '18 at 17:46 — Manmeet Khurana, Jun 12 '18 at 17:46

score 2 · Answer 4 · edited Feb 07 '13 at 14:26

2

Most of the search engine oriended sites are use FULL-TEXT-SEARCH. It will be very faster compare to select and LIKE... I have added one example and some links ... I think it will be useful for you... In this full text search have some conditions also...

STEP:1

CREATE TABLE articles (
    id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
    title VARCHAR(200),
    body TEXT,
    FULLTEXT (title,body)
);

STEP:2

INSERT INTO articles (title,body) VALUES
    ('MySQL Tutorial','DBMS stands for DataBase ...'),
    ('How To Use MySQL Well','After you went through a ...'),
    ('Optimizing MySQL','In this tutorial we will show ...'),
    ('1001 MySQL Tricks','1. Never run mysqld as root. 2. ...'),
    ('MySQL vs. YourSQL','In the following database comparison ...'),
    ('MySQL Security','When configured properly, MySQL ...');

STEP:3
Natural Language Full-Text Searches:

SELECT * FROM articles
    WHERE MATCH (title,body) AGAINST ('database');

Boolean Full-Text Searches

SELECT * FROM articles WHERE MATCH (title,body)
     AGAINST ('+MySQL -YourSQL' IN BOOLEAN MODE);

Go through this links viralpatel.net,devzone.zend.com,sqlmag.com,colorado.edu,en.wikipedia.org

edited Feb 07 '13 at 14:26

rynhe

2,509
1
21
27

answered Feb 04 '13 at 07:35

Malaiyandi Murugan

393
3
15

thank u....... if u have any doubt let me know..... i used this one for big database...... – Malaiyandi Murugan Feb 04 '13 at 10:01
I got it running pretty well, but I still get "Using where,Using filesort" when i run this query `SELECT SQL_CALC_FOUND_ROWS *, MATCH(`businessunit`.`Name`) AGAINST('städ') AS score FROM businessunit INNER JOIN `businessunit-postaddress` ON `businessunit`.`Id`=`businessunit-postaddress`.`BusinessUnit` WHERE MATCH(`Name`) AGAINST('städ') ORDER BY `businessunit`.`Premium` DESC, score DESC LIMIT 60,30` so the ordering is still affecting the performance – Breezer Feb 04 '13 at 10:38
hi breezer pls add index for selecting datas.......then u can check with query....... – Malaiyandi Murugan Feb 04 '13 at 10:44
http://www.c-sharpcorner.com/UploadFile/krishnasarala/creating-full-text-index-in-sql-server/ refer this – Malaiyandi Murugan Feb 04 '13 at 10:52
this is the index http://img526.imageshack.us/img526/7168/indexes.png and this is the explain result http://img824.imageshack.us/img824/8253/explainy.png – Breezer Feb 04 '13 at 10:52
let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/23890/discussion-between-breezer-and-malai) – Breezer Feb 04 '13 at 10:55
if u using without join na it will be fast – Malaiyandi Murugan Feb 04 '13 at 11:27

score 2 · Answer 5 · answered Feb 05 '13 at 23:39

If the "ORDER BY" is really the bottleneck, the straight-forward solution would be to remove the "ORDER BY" logic from your query, and re-implement the sorting directly in your application's code using C# sorting. Unfortunately, this means you'd also have to move your pagination into your application, since you'd need to obtain the complete result set before you can sort & paginate it. I'm just mentioning this because no-one else so far appears to have thought of it.

Frankly (like others have pointed out), the query you showed at the top should not need full-text indexing. A single suffix wildcard (e.g., LIKE 'ABC%') should be very effective as long as a BTREE (and not a HASH) index is available on the column in question.

And, personally, I have no aversion to even double-wildcard (e.g., LIKE '%ABC%"), which of course can never make use of indexes, as long as a full table scan is cheap. Probably 250,000 rows is the point where I'll start to seriously consider full-text indexing. 100,000 is definitely no problem.

I always make sure my SELECT's are dirty-reads, though (no transactionality applied to the select).

It's dirty once it gets to the user's eyeballs in any case!

knagaev · Answer 6 · 2013-02-09T16:28:07.653

It's so strange query :) Let's try to understand what it does.

The results are less than 30 rows from the table "businessunit" with some conditions.

The first condition is a foreign key of the "businessunit-postaddress" table.
Please check if you have an index on the column businessunit-postaddress.BusinessUnit.

The second one is a filter for returning rows only with businessunit.Name begining with 'tanto'.
If I didn't make a mistake you have a very complex index 'Business' consists of 11 fields!
And field 'Name' is not the first field in this index.
So this index is useless when you run "like tanto%"'s query.
I have strong doubt about necessity of this index at all.
By the way it demands quite big resources for its maintaining and slow down edit operations with this table.
You have to make an index with the only field 'Name'.

After filtering the query is sorting results and do it in some strange way too.
At first it sorts by field businessunit.Premium - it's normal.
However next statements with CASE are useless too.
That's why.
The zero are assigned to Name = 'tanto' (exactly).
The next rows with the one are rows with space after 'tanto' - these will be after 'tanto' in any cases (except special symbols) cause space is lower than any letter.
The next rows with the two are rows with some letters after 'tanto' (include space!). These rows will be in this order too by definition.
And the three is "reserved" for "other" rows but you won't get "other" rows - remeber about [WHERE businessunit.Name LIKE 'tanto%'] condition.
So this part of ORDER BY is meaningless.
And at the end of ORDER BY there is businessunit.Name again...

My advice: you need rebuild the query from scratch keeping in mind what you want to get.

Anyway I guess you can use

SELECT SQL_CALC_FOUND_ROWS *
FROM `businessunit`
INNER JOIN `businessunit-postaddress` ON `businessunit`.`Id` = `businessunit-postaddress`.`BusinessUnit`
WHERE `businessunit`.`Name` LIKE 'tanto%'
ORDER BY `businessunit`.`Premium` DESC,
`businessunit`.`Name`
LIMIT 0 , 30

Don't forget about an index on field businessunit-postaddress.BusinessUnit!

And I have strong assumption about field Premium. I guess it is designed for storing binary data (yes/no). So an ordinary (BTREE) index doesn't match. You have to use bitmap index.

P.S. I'm not sure that you really need to use SQL_CALC_FOUND_ROWS MySQL: Pagination - SQL_CALC_FOUND_ROWS vs COUNT()-Query

2 thinks that comes in mind pretty instant is that there is no bitman indexing in mysql, well not as far as i know, and the search itself is still quite slow compared to fulltext searching, anyways by the tests I've made, but that Count vs SQL_Calc part seems intresting, thx m8 — Breezer, Feb 12 '13 at 16:51
@Breezer Did you try my query instead of the original one? Of course, check existence of an index on businessunit-postaddress.BusinessUnit. If you'd like we can discuss my post by e-mail. It seems to me that the problem is much simpler than you think. And you don't need fulltext search in this case. — knagaev, Feb 13 '13 at 11:07

Javier Brooklyn · Answer 7 · 2013-02-16T05:24:01.020

Its either full-text(http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html) or the pattern matching (http://dev.mysql.com/doc/refman/5.0/en/pattern-matching.html) from php and mysql side.

From experience and theory:

Advantages of full-text -
1) Results are very relevant and de-limit characters like spacing in the search query does not hinder the search.
Disadvantages of full-text -
1) There are stopwords that are used as restrictions by webhosters to prevent excess load of data.(E.g. search results containing the word 'one' or 'moz' are not displayed. And this can be avoided if you're running your own server by keeping no stopwords.
2) If I type 'ree' it only displays words containing exactly 'ree' not 'three' or 'reed'.

Advantages of pattern matching -
1) It does not have any stopwords as in full-text and if you search for 'ree', it displays any word containing 'ree' like 'reed' or 'three' unlike fulltext where only the exact word is retreived.
Disadvantages of pattern matching-
1) If delimiters like spaces are used in your search words and if these spaces are not there in the results, because each word is separate from any delimiters, then it returns no result.

fthiella · Answer 8 · 2013-02-04T11:34:09.347

If the argument of LIKE doesn't begin with a wildchard character, like in your example, LIKE operator should be able to take advantage of indexes.

In this case, LIKE operator should perform better than LOCATE or LEFT, so I suspect that changing the condition like this could make things worse, but I still think it's worth trying (who knows?):

WHERE LOCATE('tanto', `businessunit`.`Name`)=1

or:

WHERE LEFT(`businessunit`.`Name`,5)='tanto'

I would also change your order by clause:

ORDER BY
  `businessunit`.`Premium` DESC ,
   CASE WHEN `businessunit`.`Name` LIKE 'tanto %' THEN 1
        WHEN `businessunit`.`Name` = 'tanto'      THEN 0
        ELSE 2 END,
   `businessunit`.`Name`

Name has to be LIKE 'tanto%' already, so you can skip a condition (CASE will never return value 3). Of course, make sure that Premium field is indexed.

Hope this helps.

score 0 · Answer 9 · answered Feb 07 '13 at 06:08

I think you need to collect the keys only, sort them, then join last

SELECT A.*,B.* FROM
(
    SELECT * FROM (
        SELECT id BusinessUnit,Premium
            CASE
                WHEN Name = 'tanto'      THEN 0
                WHEN Name LIKE 'tanto %' THEN 1
                WHEN Name LIKE 'tanto%'  THEN 2
                ELSE 3
            END SortOrder
        FROM businessunit Name LIKE 'tanto%'
    ) AA ORDER BY Premium,SortOrder LIMIT 0,30
) A LEFT JOIN `businessunit-postaddress` B USING (BusinessUnit);

This will still generate a filesort.

You may want to consider preloading the needed keys in a separate table you can index.

CREATE TABLE BusinessKeys
(
    id int not null auto_increment,
    BusinessUnit int not null,
    Premium      int not null,
    SortOrder    int not null,
    PRIMARY KEY (id),
    KEY OrderIndex (Premuim,SortOrder,BusinessUnit)
);

Populate all keys that match

INSERT INTO BusinessKeys (BusinessUnit,Premuim,SortOrder)
SELECT id,Premium
    CASE
        WHEN Name = 'tanto'      THEN 0
        WHEN Name LIKE 'tanto %' THEN 1
        WHEN Name LIKE 'tanto%'  THEN 2
        ELSE 3
    END
FROM businessunit Name LIKE 'tanto%';

Then, to paginate, run LIMIT on the BusinessKeys only

SELECT A.*,B.*
FROM
    (
        SELECT FROM BusinessKeys
        ORDER BY Premium,SortOrder
        LIMIT 0,30
    ) BK
    LEFT JOIN businessunit A ON BK.BusinessUnit = A.id
    LEFT JOIN `businessunit-postaddress` B ON A.BusinessUnit = B.BusinessUnit
;

CAVEAT : I use LEFT JOIN instead of INNER JOIN because LEFT JOIN preserves the order of the keys from the left side of the query.

I can't get that first query running, but I like your approach, if only i could get it running so I could see how it performance — Breezer, Feb 07 '13 at 08:44

score 0 · Answer 10 · answered Feb 08 '13 at 17:39

I've read the answer to use Sphinx to optimize the search. But regarding my experience I would advise a different solution. We used Sphinx for some years and had a few nasty problems with segmentation faults and corrupted indice. Perhaps Sphinx isn't as buggy as a few years before, but for a year now we are very happy with a different solution:

http://www.elasticsearch.org/

The great benefits:

Scalability - you can simply add another server with nearly zero configuration. If you know mysql replication, you'll love this feature
Speed - Even under heavy load you get good results in much less than a second
Easy to learn - Only by knowing HTTP and JSON you can use it. If you are a Web-Developer, you feel like home
Easy to install - it is useable without touching the configuration. You just need simple Java (no Tomcat or whatever) and a Firewall to block direct access from the public
Good Javascript integration - even a phpMyAdmin-like Tool is a simple HTML-Page using Javascript: https://github.com/mobz/elasticsearch-head
Good PHP Integration with https://github.com/ruflin/Elastica
Good community support
Good documentation (it is not eye friendly, but it covers nearly every function!)

If you need an additional storage solution, you can easily combine the search engine with http://couchdb.apache.org/

Searching a big mysql database with relevance

10 Answers10