How to package Scrapy dependency to lambda?

Question

I am writing a python application which dependents on Scrapy module. It works fine locally but failed when I run it from aws lambda test console. My python project has a requirements.txt file with below dependency:

scrapy==1.6.0

I packaged all dependencies by following this link: https://docs.aws.amazon.com/lambda/latest/dg/lambda-python-how-to-create-deployment-package.html. And also, I put my source code *.py at the root level of in the zip file. My package script can be found https://github.com/zhaoyi0113/quote-datalake/blob/master/bin/deploy.sh.

It basically does two things, first run command pip install -r requirements.txt -t dist to download all dependencies to dist directory. second, copy app python source code to dist directory.

The deployment is done via terraform and below is the configuration file.

provider "aws" {
  profile    = "default"
  region     = "ap-southeast-2"
}

variable "runtime" {
  default = "python3.6"
}

data "archive_file" "zipit" {
    type        = "zip"
    source_dir  = "crawler/dist"
    output_path = "crawler/dist/deploy.zip"
}
resource "aws_lambda_function" "test_lambda" {
  filename      = "crawler/dist/deploy.zip"
  function_name = "quote-crawler"
  role          = "arn:aws:iam::773592622512:role/LambdaRole"
  handler       = "handler.handler"
  source_code_hash = "${data.archive_file.zipit.output_base64sha256}"
  runtime = "${var.runtime}"
}

It zip the directory and upload the file to lambda.

I found I get the runtime error in lambda Unable to import module 'handler': cannot import name 'etree' when there is a statement import scrapy. I didn't use etree in my code so I believe there is something used by scrapy.

My source code can be found at https://github.com/zhaoyi0113/quote-datalake/tree/master/crawler. There are only two simple python files.

It works fine if I run them locally. The error only appears in lambda. Is there a different way to package scrapy to lambda?

score 3 · Answer 1 · edited Jul 19 '19 at 05:29

3

Based on the communication with Tim, the issue is caused by incompatible library versions between local and lambda.

The easiest way to resolve this issue is to use the docker image lambci/lambda to build a package with the command:

$ docker run -v $(pwd):/outputs -it --rm lambci/lambda:build-python3.6 pip install scrapy -t /outputs/

edited Jul 19 '19 at 05:29

Tim

2,510
1
22
26

answered Jul 19 '19 at 04:57

Joey Yi Zhao

37,514
71
268
523

Tim · Accepted Answer · 2019-07-19T01:18:21.723

1

You need to provide the entire dependency tree, scrapy also has a set of dependencies (and they may also have dependencies).

The easiest way to download all the required dependencies is to use pip

$ pip -t packages/ install scrapy

This will download scrapy and all its dependencies into the folder packages.

Scrapy has lxml and pyOpenSSL as dependencies that include compiled components. Unless they are statically compiled they will likely require that the c-libraries they require are also installed on the lambda VM.

From the lxml documentation it requires:

libxml2 version 2.9.2 or later.
libxslt version 1.1.27 or later. We recommend libxslt 1.1.28 or later.

Maybe try adding installation of these to your deploy script. You should be able to use (I'm making a guess at the package names) yum -y install libxml2 libxslt

Another good idea is to test your scripts on an Amazon Linux EC2 instance as this is close to the environment that Lambda executes in.

edited Jul 19 '19 at 01:18

answered Jul 19 '19 at 01:05

Tim

2,510
1
22
26

which pip version are you using? My `pip 19.1.1` doesn't support `-d` parameter. What I am using is `pip install -r requirements.txt -t dist` which downloads all depdencies to `dist` directory. Does it work as same as your command? – Joey Yi Zhao Jul 19 '19 at 01:09
Sorry yes it has changed to `-t` – Tim Jul 19 '19 at 01:11
I already tried that but still the same issue happens. Did I miss any dependencies in my project? – Joey Yi Zhao Jul 19 '19 at 01:13
I've updated the answer with some additional info. Likely you are running into issues with CExtensions. In the past I've had to compile these on an Amazon Linux EC2 instance. I was doing a lot of work with Lambda in my previous job and am recalling what I had setup. – Tim Jul 19 '19 at 01:14
I have added `lxml` and `pyOpenSSL` on `requirements.txt` and installed the depednencies you mentioned on `ubuntu`. The compile works fine but still get the same error from lambda. The OS I am using is `18.04.2 LTS (Bionic Beaver)`. – Joey Yi Zhao Jul 19 '19 at 01:47
I'd still test on Amazon Linux EC2 instance. You might find it has compiled correctly but the .so is not loading correctly or is unable to resolve the correct library at runtime. I've had to symlink libraries into other locations before to resolve this issue. To save a lot of headaches we would get source packages and compile them on an Amazon Linux instance ensuring they are compatible. – Tim Jul 19 '19 at 02:45
1

Thanks. Finally I am using docker to solve my issue. This is the command: `docker run -v $(pwd):/outputs -it --rm lambci/lambda:build-python3.6 pip install scrapy -t /outputs/` – Joey Yi Zhao Jul 19 '19 at 04:40
Yes that would do it as well. Maybe write that up as an answer. – Tim Jul 19 '19 at 04:53
Yep, posted the answer. – Joey Yi Zhao Jul 19 '19 at 04:58
What would you put in the requirements.txt file to install scrapy and its dependencies? – leeprevost Dec 11 '21 at 19:28

How to package Scrapy dependency to lambda?

2 Answers2