0

First of all - many thanks in advance. I really appreciate it all.

  1. So I'm in need for crawling a small amount of urls rather constantly (around every hour) and get specific data

  2. A PHP site will be updated with the crawled data, I cannot change that

I've read this solution: Best solution to host a crawler? which seems to be fine and has the upside of using cloud services if you want something to be scaled up.

I'm also aware of the existence of Scrapy

Now, I winder if there is a more complete solution to this matter without me having to set all these things up. It seems to me that it's not a very distinguish problem that I'm trying to solve and I'd like to save time and have some more complete solution or instructions. I would contact the person in this thread to get more specific help, but I can't. (https://stackoverflow.com/users/2335675/marcus-lind)

Currently running Windows on my personal machine, trying to mess with Scrapy is not the easiest thing, with installation problems and stuff like that.

Do you think there is no way avoiding this specific work? In case there isn't, how do I know if I should go with Python/Scrapy or Ruby On Rails, for example?

Community
  • 1
  • 1
eddr
  • 38
  • 3

1 Answers1

0

If the data you're trying to get are reasonably well structured, you could use a third party service like Kimono or import.io.

I find setting up a basic crawler in Python to be incredibly easy. After looking at a lot of them, including Scrapy (it didn't play well with my windows machine either due to the nightmare dependencies), I settled on using Selenium's python package driven by PhantomJS for headless browsing.

Defining your crawling function would probably only take a handful of lines of code. This is a little rudimentary but if you wanted to do it super simply as a straight python script you could even do something like this and just let it run while some condition is true or until you kill the script.

from selenium import webdriver
import time
crawler = webdriver.PhantomJS()
crawler.set_window_size(1024,768)
def crawl():
    crawler.get('http://www.url.com/')
    # Find your elements, get the contents, parse them using Selenium or BeautifulSoup
while True:
    crawl()
    time.sleep(3600)
AutomaticStatic
  • 1,661
  • 3
  • 21
  • 42
  • Many thanks! Well, you see, at least as far as this solution goes, it would be pretty slow to cover a lot of URLs (though maybe it would be fast enough, haven't checked) I've came across a service called import.io and I'm investigating it now. I'll post my conclusion thanks again – eddr Nov 18 '14 at 21:11