0

I'm creating a 'smart' search engine that will look into database by relevancy. My system calculate how many words in your sentence correspond to the database field 'tag_clean' that contains text, and try to get the proper result (one per research).

For example you get 'search youpla boom' in a tag_clean field, and another entry like 'search youpla bim' if you tape 'search bim' it will show the second entry.

My system set one point per word and get the most relevant as result. Everything works but my big problem is, it ignores completely the words order !

If you have 'google image test' and 'google test' and you search 'google test image' with my system, the most relevant will be the first one, but it's the second one that's right.

I'd like a system that understand the importance of word orders, but i've no idea how to do it in SQL.

A sample of my SQL request (important part is CASE WHEN at the end):

SELECT * 
FROM keywords 
WHERE tag_clean LIKE 'google%' 
    AND (static = 0) 
    AND 
    (
        tag_clean LIKE '%google%' 
        OR tag_clean LIKE '%test%' 
        OR tag_clean LIKE '%image%'
    )
    OR 
    (
        tag_clean = 'google test image' 
        AND static = 1
    ) 
ORDER BY 
    ((CASE WHEN tag_clean LIKE '%google%' THEN 1 ELSE 0 END) 
        + (CASE WHEN tag_clean LIKE '%test%' THEN 1 ELSE 0 END) 
        + (CASE WHEN tag_clean LIKE '%image%' THEN 1 ELSE 0 END)) 
DESC LIMIT 0, 1;

Thank you people :)

Community
  • 1
  • 1

1 Answers1

1

First, I'm not sure that raw SQL is the best tool for this. You should look into the full text capabilities of whatever engine you are using. Searching for text is a rather solved problem, and databases support this functionality (through extensions to the base language).

Assuming you want to continue, the problem is your structure. You could start to add additional clauses for tag clear like '%google test%' and every other two-way combination. That might be a quick and dirty solution.

Your real problem is that you are storing relational data in a single field. There should be a keywords table with a separate row for each keyword on each document. This would have columns such as: documentID, KeyWord, and KeyWordPosition. Using the KeyWordPosition, you could start to do the proximity searches that you want.

However, you are probably better off investigating full text functionality in existing software.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • You're right, and i already thought about this 'dirty' solution, but i want something clean :) Your system can be nice, I will think about it, thank you man :) – Nothing personal Sep 14 '12 at 16:09