Unable to add/import additional python library datacompy in aws glue

Question

i am trying to import additional python library - datacompy in to the glue job which use version 2 with below step

Open the AWS Glue console.
Under Job parameters, added the following:
For Key, added --additional-python-modules. For Value, added datacompy==0.7.3, s3://python-modules/datacompy-0.7.3.whl.

from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

import datacompy

from py4j.java_gateway import java_import
SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"

## @params: [JOB_NAME, URL, ACCOUNT, WAREHOUSE, DB, SCHEMA, USERNAME, PASSWORD]
args = getResolvedOptions(sys.argv, ['JOB_NAME', 'URL', 'ACCOUNT', 'WAREHOUSE', 'DB', 'SCHEMA','additional-python-modules'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

but the job return the error

module not found error no module named 'datacompy'

how to resolve this issue?

Looks like you are installing datacompy twice. You should be able to get by with just `datacompy==0.7.31.0.2` and skip the whl file. However, if datacompy is a c based lib you might need to go down the whl route. Also, do the logs indicate an install failure of any kind? — Bob Haffner, Feb 21 '22 at 01:49
Hello @BobHaffner i tried with your step, but it failing .. i didn't find specific logs for failure.. the job failing initially when it invoke 'import datacompy'.. i am adding those job initiation part in the step here.. so one question do we need to add any logic to read from args to install lib in the code? — cloud_hari, Feb 21 '22 at 05:18
so using Spark 2.4, Python 3 (Glue Version 2.0), the following allows me to import datacompy. For Key, `--additional-python-modules` For Value, `datacompy==0.7.3`. No, you don't have to do anything in your code to install it — Bob Haffner, Feb 21 '22 at 15:21
i am using the same version and configuration, but it fails with same import error — cloud_hari, Feb 22 '22 at 16:35
That's crazy. I wonder if the Job is caching the old config. Maybe try to create a whole new Job? — Bob Haffner, Feb 22 '22 at 19:01
I'm running into the same problem and I agree, that's crazy. @BobHaffner could you share your whole job run configuration (screenshot form Glue 'Edit job' window for example, of all settings)? Would be greatly appreciated. I tried running new Glue job with multiple changes to config and nothing helped — P D, Mar 08 '22 at 16:04
@BobHaffner @cloud_hari what type of Glue jobs are you using? Is it `Python shell`? I started to worry that this option might not work on `Python shell` job run types, only on Spark ones — P D, Mar 08 '22 at 16:10
Hi @PD I'll post some screenshots in a bit. We're both using Spark 2.4, Python 3 (Glue Version 2.0). However, I've installed libs with python shell jobs before. I recall the steps being slightly different. Have you tried putting `datacompy==0.7.3` in the Python Library Path box instead of --additional-python-modules? — Bob Haffner, Mar 08 '22 at 17:37
Thank you @BobHaffner, I really appreciate your help. However it still does not work for Python shell. I tried putting it in ` Python Library Path` as you suggested but this box expects path to S3 and results in `ParamValidationError` when I try to run the job. Do you have maybe any other ideas? — P D, Mar 09 '22 at 09:02
@PD I had to go the whl file route to get it to work in python shell. See my edited answer — Bob Haffner, Mar 09 '22 at 14:28
@cloud_hari Did you set up Glue in a private VPC w/o internet access by chance? see this for more details https://medium.com/@jasonli.lijie/aws-glue-run-python-shell-job-with-external-libraries-in-private-vpc-459b9849c235 — Bob Haffner, Mar 09 '22 at 14:31

Bob Haffner · Answer 1 · 2022-03-09T14:27:15.203

With Spark 2.4, Python 3 (Glue Version 2.0)

I set the following Job Parameter

Then I can import it my Job like so

import pandas as pd
import numpy as np
import datacompy


df1 = pd.DataFrame(np.random.randn(10,2), columns=['a','b'])
df2 = pd.DataFrame(np.random.randn(10,2), columns=['a','b'])

compare = datacompy.Compare(df1, df2, join_columns='a')


print(compare.report())

and when I check the CW Log for the Job Run

If you're using a Python Shell Job, try the following

Create a datacompy whl file or you can download it from PYPI

upload that file to an S3 bucket

Then enter the path to the s3 whl file in the Python library path box

s3://my-bucket/datacompy-0.8.0-py3-none-any.whl

Unable to add/import additional python library datacompy in aws glue

1 Answers1

Linked