My crawler download from a URL a Request.body which I save on a file locally. Now I would like to connect to my aws-s3. I read the documentation but face two issues: 1. the config as well as the credential files are not of a dict type? my file is an unmodified was-credential and aws-config files.
The s3 config key is not a dictionary type, ignoring its value of: None
- the Response is a 'bytes' one and cannot not be processed by the feeder as such. I tried Response.text and then got the same error raised but with 'str'.
Any help is highly appreciated. Thank you.
Additional information:
config file (path ~/.aws/config):
[default]
Region=eu-west-2
output=csv
and
credentials file (path ~/.aws/credentials):
[default]
aws_access_key_id=aws_access_key_id=foo
aws_secret_access_key=aws_secret_access_key=bar
the link to Scrapy documentation: https://docs.scrapy.org/en/latest/topics/settings.html?highlight=s3
MacBook-Pro:aircraftPositions frederic$ scrapy crawl aircraftData_2018
2019-03-13 15:35:04 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: aircraftPositions)
2019-03-13 15:35:04 [scrapy.utils.log] INFO: Versions: lxml 4.3.1.0, libxml2 2.9.9, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 18.9.0, Python 3.7.1 (v3.7.1:260ec2c36a, Oct 20 2018, 03:13:28) - [Clang 6.0 (clang-600.0.57)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1a 20 Nov 2018), cryptography 2.5, Platform Darwin-18.2.0-x86_64-i386-64bit
2019-03-13 15:35:04 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'aircraftPositions', 'CONCURRENT_REQUESTS': 32, 'CONCURRENT_REQUESTS_PER_DOMAIN': 32, 'DOWNLOAD_DELAY': 10, 'FEED_FORMAT': 'json', 'FEED_STORE_EMPTY': True, 'FEED_URI': 's3://flightlists/lists_v1/%(name)s/%(time)s.json', 'NEWSPIDER_MODULE': 'aircraftPositions.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['aircraftPositions.spiders'], 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'}
2019-03-13 15:35:04 [scrapy.extensions.telnet] INFO: Telnet Password: 2f2c11f3300481ed
2019-03-13 15:35:04 [py.warnings] WARNING: /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/utils/misc.py:144: ScrapyDeprecationWarning: Initialising `scrapy.extensions.feedexport.S3FeedStorage` without AWS keys is deprecated. Please supply credentials or use the `from_crawler()` constructor.
return objcls(*args, **kwargs)
2019-03-13 15:35:04 [botocore.hooks] DEBUG: Changing event name from creating-client-class.iot-data to creating-client-class.iot-data-plane
2019-03-13 15:35:04 [botocore.hooks] DEBUG: Changing event name from before-call.apigateway to before-call.api-gateway
2019-03-13 15:35:04 [botocore.hooks] DEBUG: Changing event name from request-created.machinelearning.Predict to request-created.machine-learning.Predict
2019-03-13 15:35:04 [botocore.hooks] DEBUG: Changing event name from before-parameter-build.autoscaling.CreateLaunchConfiguration to before-parameter-build.auto-scaling.CreateLaunchConfiguration
2019-03-13 15:35:04 [botocore.hooks] DEBUG: Changing event name from before-parameter-build.route53 to before-parameter-build.route-53
2019-03-13 15:35:04 [botocore.hooks] DEBUG: Changing event name from request-created.cloudsearchdomain.Search to request-created.cloudsearch-domain.Search
2019-03-13 15:35:04 [botocore.hooks] DEBUG: Changing event name from docs.*.autoscaling.CreateLaunchConfiguration.complete-section to docs.*.auto-scaling.CreateLaunchConfiguration.complete-section
2019-03-13 15:35:04 [botocore.hooks] DEBUG: Changing event name from before-parameter-build.logs.CreateExportTask to before-parameter-build.cloudwatch-logs.CreateExportTask
2019-03-13 15:35:04 [botocore.hooks] DEBUG: Changing event name from docs.*.logs.CreateExportTask.complete-section to docs.*.cloudwatch-logs.CreateExportTask.complete-section
2019-03-13 15:35:04 [botocore.hooks] DEBUG: Changing event name from before-parameter-build.cloudsearchdomain.Search to before-parameter-build.cloudsearch-domain.Search
2019-03-13 15:35:04 [botocore.hooks] DEBUG: Changing event name from docs.*.cloudsearchdomain.Search.complete-section to docs.*.cloudsearch-domain.Search.complete-section
2019-03-13 15:35:04 [botocore.loaders] DEBUG: Loading JSON file: /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/botocore/data/endpoints.json
2019-03-13 15:35:04 [botocore.hooks] DEBUG: Event choose-service-name: calling handler <function handle_service_name_alias at 0x103ca4b70>
2019-03-13 15:35:04 [botocore.loaders] DEBUG: Loading JSON file: /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/botocore/data/s3/2006-03-01/service-2.json
2019-03-13 15:35:04 [botocore.hooks] DEBUG: Event creating-client-class.s3: calling handler <function add_generate_presigned_post at 0x103c688c8>
2019-03-13 15:35:04 [botocore.hooks] DEBUG: Event creating-client-class.s3: calling handler <function add_generate_presigned_url at 0x103c686a8>
2019-03-13 15:35:04 [botocore.args] DEBUG: The s3 config key is not a dictionary type, ignoring its value of: None
2019-03-13 15:35:04 [botocore.endpoint] DEBUG: Setting s3 timeout as (60, 60)
2019-03-13 15:35:04 [botocore.loaders] DEBUG: Loading JSON file: /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/botocore/data/_retry.json
2019-03-13 15:35:04 [botocore.client] DEBUG: Registering retry handlers for service: s3
2019-03-13 15:35:04 [botocore.client] DEBUG: Defaulting to S3 virtual host style addressing with path style addressing fallback.
2019-03-13 15:35:04 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2019-03-13 15:35:04 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
... after that it is enabling the downloader and the middleware.
This is the spider:
class QuotesSpider(scrapy.Spider):
name = "aircraftData_2018"
def url_values(self):
time = list(range(1538140980, 1538140780, -60))
return time
def start_requests(self):
allowed_domains = ["https://domaine.net"]
list_urls = []
for n in self.url_values():
list_urls.append("https://domaine.net/.../.../all/{}".format(n))
for url in list_urls:
yield scrapy.Request(url=url, callback=self.parse, dont_filter=True)
def parse(self, response):
i = AircraftpositionsItem
i['url'] = response.url
i['body'] = response.body
yield I
This is the pipeline.py
class AircraftpositionsPipeline(object):
def process_item(self, item, spider):
return item
def return_body(self, response):
page = response.url.split("/")[-1]
filename = 'aircraftList-{}.csv'.format(page)
with open(filename, 'wb') as f:
f.write(response.body)