2

I have a functioning spider project to extract urls content (no css). I crawled several set of data and stored them in a series of .csv files. Now I try to set it up to work on Scrapinghub in order to go for a long run scraping. So far, I am able to get the spider uploaded and work on Scrapinghub. My problem is the result appears in the 'log' and not under the 'item'. The amount of data exceeds the log capacity and thus gives me an error. How can I set my pipelines/extractor to work and return a js or csv file? I am happy with a solution that have the scraped data to be sent to a database. As I failed to achieve that too. Any guidance is appreciated.

The spider:

class DataSpider(scrapy.Spider):
name = "Data_2018"

def url_values(self):
    time = list(range(1538140980, 1538140820, -60))
    return time

def start_requests(self):
    allowed_domains = ["https://website.net"]
    list_urls = []
    for n in self.url_values():
        list_urls.append("https://website.net/.../.../.../all/{}".format(n))

    for url in list_urls:
        yield scrapy.Request(url=url, callback=self.parse, dont_filter=True)

def parse(self, response):
    data = response.body
    items = positionsItem()
    items['file'] = data
    yield items

The pipeline

class positionsPipeline(object):

def process_item(self, item, spider):
    return item

The settings

BOT_NAME = 'Positions'
SPIDER_MODULES = ['Positions.spiders']
NEWSPIDER_MODULE = 'Positions.spiders'
USER_AGENT = get_random_agent()
ROBOTSTXT_OBEY = True
CONCURRENT_REQUESTS = 32
DOWNLOAD_DELAY = 10
SPIDER_MIDDLEWARES = {
'Positions.middlewares.positionsSpiderMiddleware': 543,
    }
DOWNLOADER_MIDDLEWARES = {
   'Positions.middlewares.positionsDownloaderMiddleware':       543,
  }
ITEM_PIPELINES = {
   'Positions.pipelines.positionsPipeline': 300,
}
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0
HTTPCACHE_DIR = 'httpcache'
HTTPCACHE_IGNORE_HTTP_CODES = []
HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

the item


class positionsItem(scrapy.Item):
file = scrapy.Field()

Scrapinghub log shows:

13: 2019-02-28 07:46:13 ERROR   Rejected message because it was too big: ITM {"_type":"AircraftpositionsItem","file":"{\"success\":true,\"payload\":{\"aircraft\":{\"0\":{\"000001\":[null,null,\"CFFAW\",9.95729,-84.1405,9500,90,136,1538140969,null,null,\"2000\",\"2-39710687\",[9.93233,-84.1386,277]],\"000023\":[\"ULAC\",null,\"PH4P4\",
Freddy
  • 73
  • 8
  • In the ScrapingHub log, what does it show in the line (~5) containing `Overridden settings: `? Does it show `'LOG_ENABLED': False, 'LOG_LEVEL': 'INFO'`? – malberts Feb 28 '19 at 09:56
  • Can you show your scrapy.cfg? – Rafael Almeida Feb 28 '19 at 10:09
  • thanks @Rafael: -here the cfg content:[settings] default = Positions.settings [deploy] #url = http://localhost:6800/ project = Positions – Freddy Feb 28 '19 at 10:25
  • thanks @malberts, it shows 'LOG_ENABLED': False, 'MEMUSAGE_LIMIT_MB': 950, – Freddy Feb 28 '19 at 10:31
  • @Freddy Actually your other comment with the log output was useful too. Put that into your question. How big exactly is one of those responses you put into `file`? What happens if you change your URLs to download only 1 item? – malberts Feb 28 '19 at 10:33
  • @malberts: in addition if that help I found this a few lines below: '[scrapy.middleware] Enabled item pipelines: []' – Freddy Feb 28 '19 at 10:37
  • @Freddy Based on your log output, you are hitting an item size limit. See here: https://support.scrapinghub.com/support/discussions/topics/22000009523 and this [FAQ entry](https://support.scrapinghub.com/support/solutions/articles/22000218173-why-do-i-get-rejected-message-because-it-was-too-big-error-). You need to store only the required fields in that response, not the whole response (which is more than 1MB in size). If you actually need all of that, refer to the suggestions in the first link. – malberts Feb 28 '19 at 10:38
  • @malberts thanks for the answer, I'll check the link right now. If it try to download one time, the same thing happen. and yes the size is about 950B to above a 1MB. – Freddy Feb 28 '19 at 10:51
  • @Freddy I haven't done that myself, but that sounds about right. – malberts Feb 28 '19 at 10:57

1 Answers1

0

From your settings file it looks like there isn't a predefined feed output mechanism for Scrapy to use. It's odd that it worked the first time locally (in producing a .csv file).

In any case, here's the extra lines in settings.py you need to add for the Scrapy to work. If you just want to feed the output locally to a .csv file:

# Local .csv version
FEED_URI = 'file://NAME_OF_FILE_PATH.csv'
FEED_FORMAT = 'csv'

I also use this version for uploading a json file to an S3 bucket

# Remote S3 .json version
AWS_ACCESS_KEY_ID = YOUR_AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY = YOUR_AWS_SECRET_ACCESS_KEY

FEED_URI = 's3://BUCKET_NAME/NAME_OF_FILE_PATH.json'
FEED_FORMAT = 'json'
Ze Xuan
  • 56
  • 6