-1

I try to make a WebCrawler which find links from a homepage and visit the found links again and again.. Now i have written a code w9ith a parser which shows me the found links and print there statistics of some tags of this homepage but i dont get it how to visit the new links in a loop and print there statistics too.

*

@visit = {}
@src = Net::HTTP.start(@url.host, @url.port) do |http| 
                http.get(@url.path)
@content = @src.body

*

def govisit
        if @content =~ @commentTag
        end

        cnt = @content.scan(@aTag) 
        cnt.each do |link| 
            @visit[link] = []
        end

        puts "Links on this site: "
        @visit.each do |links|
            puts links
        end

        if @visit.size >= 500
            exit 0
        end

        printStatistics
    end
Faculty
  • 69
  • 7

1 Answers1

0

First of all you need a function that accepts a link and returns the body output. Then parse all the links out of the body and keep a list of links. Check that list if you didn't visit the link yet. Remove those visited links from the new links list and call the same function again and do it all over.

To stop the crawler at a certain point you need to build in a condition the while loop.

based on your code:

@visited_links = []
@new_links = []

def get_body(link)
  @visited_links << link
  @src = Net::HTTP.start(@url.host, @url.port) { |http|  http.get(@url.path) }
  @src.body
end

def get_links(body)
  # parse the links from your body
  # check if the content does not have the same link
end

start_link_body = get_body("http://www.test.com")

get_links(start_link_body)

while @visited_links < 500 do
  body = get_body(@new_links.shift)
  get_links(body)
end
Vince V.
  • 3,115
  • 3
  • 30
  • 45