2

I am running a batch of 500 crawl jobs on scrapyd fired from a shell script. I am having this issue locally on mac as well as on ec2 instance. These crawl jobs have been working fine with a batch of 100 however when i run it for 500 it throws "sqlite3.OperationalError: unable to open database file" exception after about 300.

Note: Each crawl(one spider) is a project and is deployed on scrapyd, which means it would have 500 projects deployed.

After about 300 crawls are done I start seeing this exception and cannot deploy anymore projects. If I restart the scrapyd server it will not restart again, throws the same exception.

Only way i can get to start again and crawl again is by

  1. stopping the server
  2. rm -rf dbs files
  3. rm -rf eggs (probably not required)
  4. rm -rf logs (probably not required)
  5. start server

    Any ideas why this would happen? Here is the exception

    2017-04-13T23:28:57+0000 [stdout#info] 1
    2017-04-13T23:28:57+0000 [stdout#info] Traceback (most recent call last):
    2017-04-13T23:28:57+0000 [stdout#info]   File "/usr/lib64/python2.7/runpy.py", line 174, in _run_module_as_main
    2017-04-13T23:28:57+0000 [stdout#info]   File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code
    2017-04-13T23:28:57+0000 [stdout#info]   File "/home/ec2-user/scrapyENV/lib/python2.7/site-packages/scrapyd/runner.py", line 39, in <module>
    2017-04-13T23:28:57+0000 [stdout#info]   File "/home/ec2-user/scrapyENV/lib/python2.7/site-packages/scrapyd/runner.py", line 34, in main
    2017-04-13T23:28:57+0000 [stdout#info]   File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__
    2017-04-13T23:28:57+0000 [stdout#info]   File "/home/ec2-user/scrapyENV/lib/python2.7/site-packages/scrapyd/runner.py", line 13, in project_environment
    2017-04-13T23:28:57+0000 [stdout#info]   File "/home/ec2-user/scrapyENV/local/lib/python2.7/site-packages/scrapyd/__init__.py", line 14, in get_application
    2017-04-13T23:28:57+0000 [stdout#info]   File "/home/ec2-user/scrapyENV/local/lib/python2.7/site-packages/scrapyd/app.py", line 37, in application
    2017-04-13T23:28:57+0000 [stdout#info]   File "/home/ec2-user/scrapyENV/local/lib/python2.7/site-packages/scrapyd/website.py", line 35, in __init__
    2017-04-13T23:28:57+0000 [stdout#info]   File "/home/ec2-user/scrapyENV/local/lib/python2.7/site-packages/scrapyd/website.py", line 38, in update_projects
    2017-04-13T23:28:57+0000 [stdout#info]   File "/home/ec2-user/scrapyENV/local/lib/python2.7/site-packages/scrapyd/poller.py", line 30, in update_projects
    2017-04-13T23:28:57+0000 [stdout#info]   File "/home/ec2-user/scrapyENV/local/lib/python2.7/site-packages/scrapyd/utils.py", line 61, in get_spider_queues
    2017-04-13T23:28:57+0000 [stdout#info]   File "/home/ec2-user/scrapyENV/local/lib/python2.7/site-packages/scrapyd/spiderqueue.py", line 12, in __init__
    2017-04-13T23:28:57+0000 [stdout#info]   File "/home/ec2-user/scrapyENV/local/lib/python2.7/site-packages/scrapyd/sqlite.py", line 98, in __init__
    2017-04-13T23:28:57+0000 [stdout#info] sqlite3.OperationalError: unable to open database file
    2017-04-13T23:28:57+0000 [_GenericHTTPChannelProtocol,673,10.0.3.119] Unhandled Error
            Traceback (most recent call last):
              File "/home/ec2-user/scrapyENV/local/lib64/python2.7/site-packages/twisted/web/http.py", line 1845, in allContentReceived
                req.requestReceived(command, path, version)
              File "/home/ec2-user/scrapyENV/local/lib64/python2.7/site-packages/twisted/web/http.py", line 766, in requestReceived
                self.process()
              File "/home/ec2-user/scrapyENV/local/lib64/python2.7/site-packages/twisted/web/server.py", line 190, in process
                self.render(resrc)
              File "/home/ec2-user/scrapyENV/local/lib64/python2.7/site-packages/twisted/web/server.py", line 241, in render
                body = resrc.render(self)
            --- <exception caught here> ---
              File "/home/ec2-user/scrapyENV/local/lib/python2.7/site-packages/scrapyd/webservice.py", line 17, in render
                return JsonResource.render(self, txrequest)
              File "/home/ec2-user/scrapyENV/local/lib/python2.7/site-packages/scrapyd/utils.py", line 19, in render
                r = resource.Resource.render(self, txrequest)
              File "/home/ec2-user/scrapyENV/local/lib64/python2.7/site-packages/twisted/web/resource.py", line 250, in render
                return m(request)
              File "/home/ec2-user/scrapyENV/local/lib/python2.7/site-packages/scrapyd/webservice.py", line 68, in render_POST
                spiders = get_spider_list(project)
              File "/home/ec2-user/scrapyENV/local/lib/python2.7/site-packages/scrapyd/utils.py", line 116, in get_spider_list
                raise RuntimeError(msg.splitlines()[-1])
            exceptions.RuntimeError: sqlite3.OperationalError: unable to open database file
    

    I am guessing scrapyd after 300 projects is running out of space which is why popen fails but looks like the box has some space. Any pointers will be helpful.

    I am running scrapyd 1.3.3 on ec2 instance with default config and python 2.7.

    Doing lsof on the dbs folder shows me two entries for each .db file. Is this expected?

    
    scrapyd    6363 ec2-user 1005u      REG              202,1      2048   148444 /home/ec2-user/scrapyENV/bin/dbs/LatamPtBlogGenesysCom.db
    scrapyd    6363 ec2-user 1006u      REG              202,1      2048   148444 /home/ec2-user/scrapyENV/bin/dbs/LatamPtBlogGenesysCom.db
    scrapyd    6363 ec2-user 1007u      REG              202,1      2048   148503 /home/ec2-user/scrapyENV/bin/dbs/WwwPeeblesshirenewsCom.db
    scrapyd    6363 ec2-user 1009u      REG              202,1      2048   148503 /home/ec2-user/scrapyENV/bin/dbs/WwwPeeblesshirenewsCom.db
    
    
vishal
  • 21
  • 2

0 Answers0