7

I am new to scrapy and scrapyd. Did some reading and developed my crawler which crawls a news website and gives me all the news articles from it. If I run the crawler simply by

scrapy crawl project name -o something.txt

It gives me all scraped data in something.txt correctly.

Now I tried deploying my scrapy crawler project on localhost:6800 using scrapyd.

And I schduled the crawler using

curl http://localhost:6800/schedule.json -d project=tutorial -d spider=dmoz_spider

it gives me this on command line

{"status": "ok", "jobid": "545dfcf092de11e3ad8b0013d43164b8"}

which is I think is correct and I am even able to see my cralwer as a job on UI view of localhost:6800

But where do I find the data that is scraped by my crawler which I used to collect previously in something.txt.

Please help....

this is my crawler code

class DmozSpider(Spider):
    name = "dmoz"
    allowed_domains = ["timesofindia.com"]
    start_urls = ["http://mobiletoi.timesofindia.com/htmldbtoi/TOIPU/20140206/TOIPU_articles__20140206.html"]

    def parse(self, response):
    sel = Selector(response)
        torrent = DmozItem()
    items=[]
    links = sel.xpath('//div[@class="gapleftm"]/ul[@class="content"]/li')
        sel.xpath("//div[@class='gapleftm']/ul[@class='content']/li/b/a/stname/text()").extract()
    sel.xpath("//div[@class='gapleftm']/ul[@class='content']/li/b/a/@href").extract()

    for ti in sel.xpath("//a[@class='pda']/text()").extract():
        yield DmozItem(title=ti)
    for url in sel.xpath("//a[@class='pda']/@href").extract():
        itemLink = urlparse.urljoin(response.url, url)  
        yield DmozItem(link=url)    
        yield Request(itemLink, callback=self.my_parse)

    def my_parse(self, response):
    sel = Selector(response)
    self.log('A response from my_parse just arrived!')
    for head in sel.xpath("//b[@class='pda']/text()").extract():
        yield DmozItem(heading=head)
    for text in sel.xpath("//a[@class='pda']/text()").extract():
        yield DmozItem(desc=text)
    for url_desc in sel.xpath("//a[@class='pda']/@href").extract():
        itemLinkDesc = urlparse.urljoin(response.url, url_desc) 
        yield DmozItem(link=url_desc)   
        yield Request(itemLinkDesc, callback=self.my_parse_desc)

    def my_parse_desc(self, response):
        sel = Selector(response)
        self.log('ENTERED ITERATION OF MY_PARSE_DESC!')
        for bo in sel.xpath("//font[@class='pda']/text()").extract():
            yield DmozItem(body=bo) 
Yogesh D
  • 1,663
  • 2
  • 23
  • 38
  • 1
    Check `/var/log/scrapyd/`. – Blender Feb 11 '14 at 05:45
  • thnx got the output in `f980130e92e711e3ad8b0013d43164b8.log` file inside the `/var/log/scrapyd/` – Yogesh D Feb 11 '14 at 06:48
  • @Blender But as per the tutorial on [scrapyd tutorial](http://scrapyd.readthedocs.org/en/latest/install.html) I should get any standard o/p in `var/log/scrapyd/scrapyd.out` but I am not getting anything in that file.... – Yogesh D Feb 11 '14 at 07:31
  • 1
    @Blender though I am getting the O/p in logs I actually need to have my output in separate JSON file as I have further data extraction and processing to be done on it at server side. – Yogesh D Feb 13 '14 at 05:24
  • 1
    Look in `/etc/scrapyd/scrapyd.conf` and see what `items_dir` is set to. – Blender Feb 13 '14 at 07:18
  • @Blender path is set to `/var/lib/scrapyd/items` got you point that if I changed this path I can get my output file at path that I want but the output file I am getting is in `.jl` extension and even its name is the Job Id of the crawler job instead I want my own file name and JSON extension. – Yogesh D Feb 13 '14 at 07:36
  • Then subclass some of Scrapyd's modules and do just that. It's not versatile. – Blender Feb 13 '14 at 07:45

1 Answers1

7

When using the feed exports you define where to store the feed using a URI (through the FEED_URI setting). The feed exports supports multiple storage backend types which are defined by the URI scheme.

curl http://localhost:6800/schedule.json -d project=tutorial -d spider=dmoz_spider -d setting=FEED_URI=file:///path/to/output.json
kev
  • 155,172
  • 47
  • 273
  • 272
  • 1
    what should be path like..? I mean something like this `file:///home/yogesh/to/output.json` – Yogesh D Feb 11 '14 at 07:06
  • Sorry but I am missing something or doing something wrong I am giving this at cmd line `curl http://localhost:6800/schedule.json -d project=tutorial -d spider=dmoz -d setting=FEED_URI=file:///home/yogesh/output.json` with this my crawler runs but I get this error in log file `exceptions.IOError: [Errno 13] Permission denied: '/home/yogesh/output.json'` – Yogesh D Feb 13 '14 at 04:59
  • 1
    @y.dixit You should give write permission to other users: `chmod 777 /path/to/`. – kev Feb 13 '14 at 05:49
  • done first did `chmod 777 /home/crawled_data` and then the JSON API command posted in answer and it worked successfully thanks kev – Yogesh D Feb 13 '14 at 06:59
  • earlier when I used to run `scrapy crawl [spider_name] -o [something].json -t json` I used to get the o/p as well formatted JSON file but if I remove `-t json` part of the command though the extension is json its not in JSON format inside....same thing is happening now....I am getiing a JSON extension file but its not well formatted...for eg: `{"title": "Front Page"}{"title": "Times City"} {"title": "Times Nation"}` this is not well formatted JSON compared to `[{"body": "WORLD RAP "}]` this – Yogesh D Feb 13 '14 at 07:48
  • @y.dixit It's `jsonline` format. It works well if the output is large. You can read it by line. – kev Feb 14 '14 at 07:58
  • I somehow wanted it to be in normal JSON format as I have my parser which works well with normal JSON format file....other format gives me some kind of overheads when I parse it . So lets see will work with it. – Yogesh D Feb 14 '14 at 11:12