Connect Scrapy crawler with S3

Question

My crawler download from a URL a Request.body which I save on a file locally. Now I would like to connect to my aws-s3. I read the documentation but face two issues: 1. the config as well as the credential files are not of a dict type? my file is an unmodified was-credential and aws-config files.

The s3 config key is not a dictionary type, ignoring its value of: None

the Response is a 'bytes' one and cannot not be processed by the feeder as such. I tried Response.text and then got the same error raised but with 'str'.

Any help is highly appreciated. Thank you.

Additional information:

config file (path ~/.aws/config):

[default]
Region=eu-west-2
output=csv

and

credentials file (path ~/.aws/credentials):

[default]
aws_access_key_id=aws_access_key_id=foo
aws_secret_access_key=aws_secret_access_key=bar

the link to Scrapy documentation: https://docs.scrapy.org/en/latest/topics/settings.html?highlight=s3

MacBook-Pro:aircraftPositions frederic$ scrapy crawl aircraftData_2018
2019-03-13 15:35:04 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: aircraftPositions)
2019-03-13 15:35:04 [scrapy.utils.log] INFO: Versions: lxml 4.3.1.0, libxml2 2.9.9, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 18.9.0, Python 3.7.1 (v3.7.1:260ec2c36a, Oct 20 2018, 03:13:28) - [Clang 6.0 (clang-600.0.57)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1a  20 Nov 2018), cryptography 2.5, Platform Darwin-18.2.0-x86_64-i386-64bit
2019-03-13 15:35:04 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'aircraftPositions', 'CONCURRENT_REQUESTS': 32, 'CONCURRENT_REQUESTS_PER_DOMAIN': 32, 'DOWNLOAD_DELAY': 10, 'FEED_FORMAT': 'json', 'FEED_STORE_EMPTY': True, 'FEED_URI': 's3://flightlists/lists_v1/%(name)s/%(time)s.json', 'NEWSPIDER_MODULE': 'aircraftPositions.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['aircraftPositions.spiders'], 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'}
2019-03-13 15:35:04 [scrapy.extensions.telnet] INFO: Telnet Password: 2f2c11f3300481ed
2019-03-13 15:35:04 [py.warnings] WARNING: /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/utils/misc.py:144: ScrapyDeprecationWarning: Initialising `scrapy.extensions.feedexport.S3FeedStorage` without AWS keys is deprecated. Please supply credentials or use the `from_crawler()` constructor.
  return objcls(*args, **kwargs)

2019-03-13 15:35:04 [botocore.hooks] DEBUG: Changing event name from creating-client-class.iot-data to creating-client-class.iot-data-plane
2019-03-13 15:35:04 [botocore.hooks] DEBUG: Changing event name from before-call.apigateway to before-call.api-gateway
2019-03-13 15:35:04 [botocore.hooks] DEBUG: Changing event name from request-created.machinelearning.Predict to request-created.machine-learning.Predict
2019-03-13 15:35:04 [botocore.hooks] DEBUG: Changing event name from before-parameter-build.autoscaling.CreateLaunchConfiguration to before-parameter-build.auto-scaling.CreateLaunchConfiguration
2019-03-13 15:35:04 [botocore.hooks] DEBUG: Changing event name from before-parameter-build.route53 to before-parameter-build.route-53
2019-03-13 15:35:04 [botocore.hooks] DEBUG: Changing event name from request-created.cloudsearchdomain.Search to request-created.cloudsearch-domain.Search
2019-03-13 15:35:04 [botocore.hooks] DEBUG: Changing event name from docs.*.autoscaling.CreateLaunchConfiguration.complete-section to docs.*.auto-scaling.CreateLaunchConfiguration.complete-section
2019-03-13 15:35:04 [botocore.hooks] DEBUG: Changing event name from before-parameter-build.logs.CreateExportTask to before-parameter-build.cloudwatch-logs.CreateExportTask
2019-03-13 15:35:04 [botocore.hooks] DEBUG: Changing event name from docs.*.logs.CreateExportTask.complete-section to docs.*.cloudwatch-logs.CreateExportTask.complete-section
2019-03-13 15:35:04 [botocore.hooks] DEBUG: Changing event name from before-parameter-build.cloudsearchdomain.Search to before-parameter-build.cloudsearch-domain.Search
2019-03-13 15:35:04 [botocore.hooks] DEBUG: Changing event name from docs.*.cloudsearchdomain.Search.complete-section to docs.*.cloudsearch-domain.Search.complete-section
2019-03-13 15:35:04 [botocore.loaders] DEBUG: Loading JSON file: /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/botocore/data/endpoints.json
2019-03-13 15:35:04 [botocore.hooks] DEBUG: Event choose-service-name: calling handler <function handle_service_name_alias at 0x103ca4b70>
2019-03-13 15:35:04 [botocore.loaders] DEBUG: Loading JSON file: /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/botocore/data/s3/2006-03-01/service-2.json
2019-03-13 15:35:04 [botocore.hooks] DEBUG: Event creating-client-class.s3: calling handler <function add_generate_presigned_post at 0x103c688c8>
2019-03-13 15:35:04 [botocore.hooks] DEBUG: Event creating-client-class.s3: calling handler <function add_generate_presigned_url at 0x103c686a8>
2019-03-13 15:35:04 [botocore.args] DEBUG: The s3 config key is not a dictionary type, ignoring its value of: None
2019-03-13 15:35:04 [botocore.endpoint] DEBUG: Setting s3 timeout as (60, 60)
2019-03-13 15:35:04 [botocore.loaders] DEBUG: Loading JSON file: /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/botocore/data/_retry.json
2019-03-13 15:35:04 [botocore.client] DEBUG: Registering retry handlers for service: s3
2019-03-13 15:35:04 [botocore.client] DEBUG: Defaulting to S3 virtual host style addressing with path style addressing fallback.
2019-03-13 15:35:04 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2019-03-13 15:35:04 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',

... after that it is enabling the downloader and the middleware.

This is the spider:

class QuotesSpider(scrapy.Spider):
    name = "aircraftData_2018"

    def url_values(self):
        time = list(range(1538140980, 1538140780, -60))
        return time

    def start_requests(self):
        allowed_domains = ["https://domaine.net"]
        list_urls = []
        for n in self.url_values():
            list_urls.append("https://domaine.net/.../.../all/{}".format(n))

        for url in list_urls:
            yield scrapy.Request(url=url, callback=self.parse, dont_filter=True)

    def parse(self, response):
        i = AircraftpositionsItem
        i['url'] = response.url
        i['body'] = response.body
        yield I

This is the pipeline.py

class AircraftpositionsPipeline(object):

    def process_item(self, item, spider):
        return item

    def return_body(self, response):
        page = response.url.split("/")[-1]
        filename = 'aircraftList-{}.csv'.format(page)
        with open(filename, 'wb') as f:
            f.write(response.body)

It would help if you could provide following: - link to a tutorial you're reading - screenshot of file with your credentials that you're using (cover actual values, I don't need your credentials, I just would need to see file format) - piece of your python code where you use those credentials — Maksim Kviatkouski, Mar 12 '19 at 17:12
@MaksimKviatkouski; thanks, find the info requested inserted in the question. — Freddy, Mar 13 '19 at 08:20
ok, can you add error stacktrace? One error line is not quite enough. Please put here whole stack (it should be multiple lines mentioning filenames and line numbers there) — Maksim Kviatkouski, Mar 13 '19 at 16:45
@MaksimKviatkouski, thanks for taking the time. here we go with all the pertinent part of the DEBUG lines. — Freddy, Mar 13 '19 at 18:50
I don't see an error you've mentioned in the logs you've provided. I'm interested in that part with several more lines around it for context. — Maksim Kviatkouski, Mar 13 '19 at 19:00
@MaksimKviatkouski if you start from the bottom and counting upwards all lines starting with a date and time, it will be line number 6. — Freddy, Mar 14 '19 at 12:26
hm... I see, doesn't seem to be a critical error. Did you specify AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY in your settings.py? Can you share your code where you save response into S3? — Maksim Kviatkouski, Mar 14 '19 at 18:29
Yes, I did specify the aws access keys. Also, to avoid any problem for the time being the access keys are explicit in the settings.py script. I added above the spider and the pipeline for you to check on the Response. I appreciate you taking so much time for this. — Freddy, Mar 14 '19 at 20:53
No problem. What are you trying to put into S3? A csv file? Are you expecting that response from that website will be in csv format? Your pipeline seems a bit odd since I don't see what would call your method return_body — Maksim Kviatkouski, Mar 14 '19 at 21:51
Also, can you share with me exact command you use to launch your spider? — Maksim Kviatkouski, Mar 14 '19 at 22:03
yes, I would like to upload on s3 csv files. It is possible that the pipeline is odd and I am trying to write an improved (that actually work) version. Do you think that the pipeline has an influence on the flies config and credentials? — Freddy, Mar 15 '19 at 08:55
I launch the spider using command line. When it will be all working I will deploy it on spiderHub to get multiple crawlers running. — Freddy, Mar 15 '19 at 08:56
sorry but I still need some information to help you: - what is the exact command you use to launch your spider? - does the website respond with csv files or website has regular html pages, your parse them and expect to generate csv which then need to be uploaded to S3? — Maksim Kviatkouski, Mar 15 '19 at 16:05
basically I crawl a url and use the response.body. This can be seen above in the spider. Then I open a file and write it in the file. This can be seen above in under pipeline. The command to launch the spider is "Scrapy crawl my spider" — Freddy, Mar 18 '19 at 14:35

Connect Scrapy crawler with S3

0 Answers0