2

I'm working with scrapy-splash to screenshot a web page and output a png with some meta-data. I know that scrapy logs all actions the engine executes with timestamps, etc, but having trouble figuring out how to access that information in my spider and pass it into an item. Any advice or tips would be most appreciated.

Desired meta-data: 1) target site IP; 2) timestamp (UTC) at page load; 3) timestamp (UTC) at page capture

import json
import base64
import scrapy
from scrapy_splash import SplashRequest
from project_spider.screenshot_format import PDF

class screenshot(scrapy.Spider):

    name = 'screenshot'

    def start_requests(self):
        url = 'http://www.gxjjw.gov.cn/staticpages/20171109/gxjjw5a03a8bc- 
               128325.shtml'

        splash_args = {
            'wait': 3.0,
            'html': 1,
            'png': 1,
            'width': 600,
            'render_all': 1,
            'wait': 3.0,
        }

        yield SplashRequest(url, self.parse_result, endpoint='render.json', 
                        args=splash_args)

    def parse_result(self, response):

        png_b64 = response.data['png']
        header = 'data:image/png;base64,'
        png_b64 = header + png_b64 

        # Meta-data variables will go here
Gallaecio
  • 3,620
  • 2
  • 25
  • 64
CLPatterson
  • 113
  • 1
  • 14

0 Answers0