2

I have a big list of domain names, and a big list of words. I want to check how many domain names in my list have each of these words at the end of them. I tried 2 queries but both are taking too much time to execute. Trying to find out if there is any way to make the query faster.

First I tried this query, it is taking around 50 minutes to return results:

SELECT COUNT(*) AS count 
FROM   table 
WHERE  domain_name LIKE '%my_word.%'; 

Then I thought maybe if I remove the .% from it, maybe it would go faster so I tried this, but it is still slow:

SELECT COUNT(*) AS count 
FROM   table 
WHERE  SUBSTRING_INDEX(domain_name, '.', 1) LIKE '%my_word'; 

Anyone has an idea of a query that might go faster?

Thanks.

CodeBird
  • 3,883
  • 2
  • 20
  • 35

4 Answers4

0

You need double SUBSTRING_INDEX and SUBSTRING from the end for a word match.
The following would count example.com as well as the example.de and somethingexample.com

SELECT COUNT(*)
  FROM t1
 WHERE SUBSTRING(
        SUBSTRING_INDEX(
         SUBSTRING_INDEX(domain_name, '.', -2),
          '.',1), length('example')*(-1)) = 'example';

SQLFiddle

Also, you can avoid typing pattern 2 times, using variables (Fiddle):

SET @WORD = 'example';
SET @LNGTH = length(@WORD)*(-1);
SELECT COUNT(*)
  FROM t1
 WHERE SUBSTRING(
        SUBSTRING_INDEX(
         SUBSTRING_INDEX(domain_name, '.', -2),
          '.',1), @LNGTH) = @WORD;
Alexander
  • 3,129
  • 2
  • 19
  • 33
  • I don't want subdomains I want domains, codeexample.com should count if I am searching for the word `example` – CodeBird Feb 21 '14 at 09:20
  • yes, but to find the count of domains that have `example` at the beggining like `examplecode.com` is very fast, like it takes few seconds, the issue is the end of the word – CodeBird Feb 21 '14 at 09:21
  • this won't work, for sure. and you're still missing a `,` between '.' and the 1. thanks for the help. – CodeBird Feb 21 '14 at 09:29
  • trying it, will let you know – CodeBird Feb 21 '14 at 09:59
  • down to 27 mins at the place of 50 using this: SELECT COUNT(*) FROM table WHERE RIGHT(SUBSTRING_INDEX(domain_name, '.',1), length('men')) = 'men'; – CodeBird Feb 21 '14 at 11:21
  • no, ran another took 50 min again, I think it is the same thing :( – CodeBird Feb 21 '14 at 12:20
  • @CodeBird how is that even possible, lol? Probably, counting length 1 time will save some resources. I updated second query (with variables), can you try it? – Alexander Feb 21 '14 at 12:26
0

For best results, you need to create FULLTEXT index on your data, use innoDb mysql engine

eg

SELECT  *
FROM    domain_name
WHERE   MATCH(data) AGAINST ('+word1 +word2 +word3' IN BOOLEAN MODE)

refer http://dev.mysql.com/doc/refman/5.0/en/fulltext-boolean.html

Dimag Kharab
  • 4,439
  • 1
  • 24
  • 45
  • will this work with no spaces? And even if it does, then it would get me back results where the word is in the middle of the domain like if I am search for `example` it would get me `codeexampletest.com` as result, which I don't want. Thanks for trying – CodeBird Feb 21 '14 at 09:26
  • #1214 - The used table type doesn't support FULLTEXT indexes :( – CodeBird Feb 21 '14 at 09:37
  • innoDb mysql engine supports , but you are using is myIsam – Dimag Kharab Feb 21 '14 at 09:38
0

Your problem can be solved by fulltext index but it is compatible only in version 5.6 for innodb and you can use in any version for myisam tables.

Create Full Text Index by below command-

ALTER TABLE my_table ADD FULLTEXT(domain_name);

Now you can use query as per below-

SELECT count(*) FROM my_table WHERE MATCH (domain_name) AGAINST ('your_word');
Zafar Malik
  • 6,734
  • 2
  • 19
  • 30
0

Since your query is not looking for separate words, I believe it would't benefit from a full text index (the engine won't know how to parse the domain into individual words). I also don't believe substring solutions will yield better performance than the LIKE operator. You are dealing with simple string matching and databases are not the best place to run it.

That said, I believe the best you can do to improve the speed for this query is to reduce the number of bytes that need to be read from disk. You can achieve it by creating a simple index on this column (the query will still be an index scan, but it will be cheaper than the table scan). Another thing you can do is to remove unnecessary characters from this column, like .com or .net, since it will allow for more meaningful data to be fetched in every access to disk.

Gerardo Lima
  • 6,467
  • 3
  • 31
  • 47
  • thanks for the suggestion but I can't remove the .com and .net etc... the database is for a running website that is actually a whois website, whoisology.com, so can't really remove info – CodeBird Feb 21 '14 at 11:20
  • Hi, @CodeBird, I am sorry, I wasn't clear, I didn't mean to remove the `.com` or `.net` from the existing column, but to create a new column, optimized for this purpose, without the bytes that don't matter. – Gerardo Lima Feb 21 '14 at 11:45