ScrapingHub and remote database

Question

I'm creating a spider with scrapy, and I want to use MySQL database to get start_urls in my spider. Now I would like to know if it's possible to connect scrapy-cloud to a remote database?

Can I run a spider in scrapinghub with a remote database to get start_urls — gueyebaba, Jul 20 '15 at 14:57

score 5 · Answer 1 · answered Jul 23 '15 at 20:09

5

You can do that by overriding the start_requests spider method:

http://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.spiders.Spider.start_requests

You can basically do anything you want from there.

Mysql python is installed by default on the scrapy cloud. Docs: http://mysql-python.sourceforge.net/

answered Jul 23 '15 at 20:09

José Ricardo

1,479
1
14
28

Thank you it was very helpfull – gueyebaba Jul 24 '15 at 08:52
Now I override start_requests, and give my IP address to host, exemple con = mdb.connect(host='192.168.1.2', user='root', passwd='admin', db='scrapinghub'). When I deploy the spider on ScrapingHub I get this error: Can't connect to MySQL server on '192.168.1.26' – gueyebaba Jul 24 '15 at 15:18
Hi, this isn't you public IP, this is you local network address. To find you public IP visit http://httpbin.org/ip from the machine that's hosting the mysql server. – José Ricardo Jul 25 '15 at 13:36
With my public IP address provided by http://httpbin.org/ip. I get the same error – gueyebaba Jul 28 '15 at 11:48
Are you sure the server is listening for connections from non-local machines? – José Ricardo Jul 30 '15 at 15:13
You are having trouble with your network setup. In order to access the Internet on the server where you are hosting your database, you are using NAT https://en.wikipedia.org/wiki/Network_address_translation probably using commodity firewall. You need to configure your firewall to allow this traffic, which is way way beyond the original scope of this question. – ftrotter Aug 28 '16 at 17:35

ScrapingHub and remote database

1 Answers1