1

I created a very basic search option for my blog, and as per topics and key words it is generating results but what i am looking for is in certain articles i have to add links so if my search can go through those links that are basically external websites for example if i am referring to someone else blog for more information then search to find from that.Is it possible ? And i don't want to go for GCSE. Thanks in advance. It will be of great help.

Thanks again.

SSH
  • 43
  • 1
  • 12
  • Could you please edit your Q so that it doesn't read as 1 long run-on sentence? Also, would you please clarify what the acronym GCSE is? I haven't seen that acronym before. – Clomp Apr 15 '16 at 18:56

1 Answers1

0

Yes, it is possible to write a bot to crawl external websites from links. I've made one. It crawled 100K+ website URLs. So yes, it is possible to make one, which can crawl links from your blog.

To create a search engine, you'll need to know some internals regarding how they work...

Search Bots work like this:

  1. Crawler fetches pages. This step is pretty easy, as it uses curl.
  2. Parser splits the HTML into pieces, so that data can be extracted from the page. This has 2 sub-components to it, which...

    a. Extracts any data from the page that you want to capture & then saves that data into a database.

    b. Extracts links & places them back into the crawling queue. This creates an infinite loop, so your bot never stops crawling... (Unless someone else's malformed URL crashes it, which happens a lot. So be ready to frequently fix it.)

  3. Indexer creates lookup indexes, which map keywords to the web page's contents. This has 2 sub-components to it, as it...

    a. Creates a Forward Index, which maps each document to keywords that are inside of that document.

    doc1 | bird, aviary, robin, dove, blue jay, cardinal
    doc2 | birds, bird watching, binoculars
    doc3 | cats, eat, birds
    doc4 | cats, generally, don't, like, water, nor, neighborhood, dogs
    doc5 | dog, shows, look, fun
    

    b. Creates an Inverted Index from the Forward Index, which reverses the indices. This allows users to search by keyword & then the search script looks up & suggests which documents, that users may want to view. Like so...

    bird | doc1, doc2
    cat  | doc3, doc4
    dog  | doc4, doc5
    

Search Forms work like this:

  1. Search Form shows the HTML input box to the user.
  2. Search Script will search the Inverted Index to find which document links to display in the Search Engine Results Page.
  3. Search Engine Results Page (yes, SERP is an actual industry acronym for Search Engine Results Page). This displays the list of search result links. You can style it any way that you'd like & it doesn't have to look like Google's, Microsoft's Bing nor Yahoo's engines.

Examples:

Searching for:

"bird" returns links to "doc1, doc2"
"cat"  returns links to "doc3, doc4"
"dog"  returns links to "doc4, doc5"

Good luck building your search engine for your blog!

Clomp
  • 3,168
  • 2
  • 23
  • 36
  • Thanks a lot for you answer,I will work on it and definitely am coming back with more doubts :) – SSH Apr 16 '16 at 21:19