2

I'm new to this really useful Q&A website and I'm not really good in English, so sorry about that.

I was interested in a web project that I think is not hard to do and it is a simplified surfing.

Algorithm description 1
Algorithm description 2

This algorithm is made ​​sure the kid is very simple because it quickly analyzes web content and find relevant information.

Can someone tell me how this algorithm functioning that I tried to make something similar?

On what principles funcionise this algorithm?

THANKS!

ffriend
  • 27,562
  • 13
  • 91
  • 132
Miki Cloud
  • 53
  • 1
  • 2
  • 5

2 Answers2

1

I just answered very similar question. In your particular case it makes sense to manually create topic list, train it with machine learning on some examples and then, during searching, classify each search result to one of topics. Thus you will get search results, grouped by topic.

UPD. Ok, here's step by step instruction for one possible way.

First of all, take a look at my recent post about document similarity computation. Then do the following:

  1. Implement procedure for computing similarity between 2 texts (as described in my post) or find something similar.
  2. Create several collections of documents, one for every category (topic) you want to use (food, IT, politics, medicine, etc.).
  3. Compute common vector of all documents in every collection.
  4. When the user performs the search, compute vector for every result you found.
  5. Classify every result to the category that has the most similar common vector.
  6. Group results by computed category.
Community
  • 1
  • 1
ffriend
  • 27,562
  • 13
  • 91
  • 132
  • Ok, Thanks, this is usefull but can you tell me something more about that? – Miki Cloud Jan 08 '12 at 20:39
  • Is maui-indexer can HELP in my case? – Miki Cloud Jan 08 '12 at 20:43
  • please look at my algorithm: http://stackoverflow.com/questions/8781545/my-algorithm-analyse – Miki Cloud Jan 08 '12 at 21:34
  • @MikiCloud: no, `similar_text` counts difference between strings (sequence of chars), not texts (sequence of words). In your scheme, are you trying to beat Google's results? Also 20 most frequent common words are not necessarily topic words. – ffriend Jan 08 '12 at 23:40
  • Not necessarily but when I remove typical words like "a,the,and ..." then I will get the words who decribe topic. What you think? How I can get a words which will describe an topic? What you suggest? – Miki Cloud Jan 09 '12 at 00:05
  • And I not try to beat google, I try to get better from google results. example: When you type: where is the paris... the results must be a text with paris location not with links with sites who you must open to view the information... I place a link above for summly iphone app, please visit – Miki Cloud Jan 09 '12 at 00:07
  • google find best sites for some keyword, I want to find best relevant information on google results sites :) ! Do you want to help me with this project? – Miki Cloud Jan 09 '12 at 00:08
  • @MikiCloud: please, do not post lots of comments - they are not intended to be chat. Check out SO rules. Concerning your questions: finding topic words is called automatic summarization or keyword extraction. Search on the web or SO for more info. Summly app does exactly what I described in my answer - classifies results to different topics. Concerning Paris example - you are talking about very complex system of general-purpose Q&A system, that involves automatic knowledge extraction and answer generating. This field is in active development, so you can participate in research groups. – ffriend Jan 09 '12 at 01:25
  • Can you tell me an reserch groups? – Miki Cloud Jan 09 '12 at 01:33
-1

NLP to me, is a program that looks at raw text, and labels it.

I look at it that way because I want to as it as a trainer (self supervision) for a GA that grunts into words, as long as you record what the user says to it in a markov chain, so you can use as much processor power as you want to accellerate mutation.

Note, I havent done it yet, but I think the idea is cool, its hackerific, and seems like it would work.