0

I am trying to run a script using pyspark to do transformations in aws in a docker container.

When I do docker run it displays in the console:

:: loading settings :: url = jar:file:/usr/local/lib/python3.9/site-packages/pyspark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
org.apache.hadoop#hadoop-aws added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-b351103a-0cb0- 46f6-9894-49096e91d74a;1.0
    confs: [default]
    found org.apache.hadoop#hadoop-aws;3.3.4 in central
    found com.amazonaws#aws-java-sdk-bundle;1.12.262 in central
    found org.wildfly.openssl#wildfly-openssl;1.0.7.Final in central
downloading https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar ...
    [SUCCESSFUL ] org.apache.hadoop#hadoop-aws;3.3.4!hadoop-aws.jar (104510ms)
downloading https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-  bundle/1.12.262/aws-java-sdk-bundle-1.12.262.jar ...
downloading https://repo1.maven.org/maven2/org/wildfly/openssl/wildfly-openssl/1.0.7.Final/wildfly-openssl-1.0.7.Final.jar ...
    [SUCCESSFUL ] org.wildfly.openssl#wildfly-openssl;1.0.7.Final!wildfly- openssl.jar (27243ms)

However after a few seconds the following error arises:

:: problems summary ::
:::: WARNINGS
[FAILED     ] com.amazonaws#aws-java-sdk-bundle;1.12.262!aws-java-sdk-     bundle.jar: Downloaded file size (9846784) doesn't match expected Content Length (280645251) for https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.262/aws-java-sdk-bundle-1.12.262.jar. Please retry. (1218943ms)

            [FAILED     ] com.amazonaws#aws-java-sdk-bundle;1.12.262!aws-java-sdk-bundle.jar: Downloaded file size (9846784) doesn't match expected Content Length (280645251) for https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.262/aws-java-sdk-bundle-1.12.262.jar. Please retry. (1218943ms)

    ==== central: tried

      https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.262/aws-java-sdk-bundle-1.12.262.jar

            ::::::::::::::::::::::::::::::::::::::::::::::

            ::              FAILED DOWNLOADS            ::

            :: ^ see resolution messages for details  ^ ::

            ::::::::::::::::::::::::::::::::::::::::::::::

            :: com.amazonaws#aws-java-sdk-bundle;1.12.262!aws-java-sdk-bundle.jar

            ::::::::::::::::::::::::::::::::::::::::::::::

I then get the error:

RuntimeError: Java gateway process exited before sending its port number

In Pyspark I use the .write.parquet method with an s3a link to write to aws. My Dockerfile looks the following:

ARG IMAGE_VARIANT=slim-buster
ARG OPENJDK_VERSION=8
ARG PYTHON_VERSION=3.9.8

FROM python:${PYTHON_VERSION}-${IMAGE_VARIANT} AS py3
FROM openjdk:${OPENJDK_VERSION}-${IMAGE_VARIANT}

COPY --from=py3 / /

ARG PYSPARK_VERSION=3.2.0
RUN pip --no-cache-dir install pyspark==${PYSPARK_VERSION} pandas datetime boto3   botocore uuid

COPY /housing_prices_2022_10.csv /app/housing_prices_2022_10.csv
ADD script.py /

CMD [ "python", "./script.py" ]

The python script looks like this:

os.environ["AWS_ACCESS_KEY_ID"] = ""
os.environ["AWS_SECRET_ACCESS_KEY"] = ""


spark = SparkSession \
    .builder \
    .config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:3.3.4") \
    .getOrCreate()

housing_prices= spark.read.option("header", "true").csv(complete_raw_path)

housing_prices.distinct()

housing_prices.write.parquet( curated_path,
                        mode = "overwrite")
paul773
  • 15
  • 4
  • Can you please share how does your script.py look like ? – SRJ Oct 14 '22 at 07:37
  • how are you constructing your spark context/instance ? – SRJ Oct 14 '22 at 07:44
  • I have just added it at the end of the question – paul773 Oct 14 '22 at 12:40
  • This might help you possibly https://stackoverflow.com/questions/47349376/gcp-dataproc-spark-jar-packages-issue-downloading-dependencies – SRJ Oct 14 '22 at 17:43
  • I tried with your dockerfile and it worked fine on my machine. It might be some network error for you. Retry few times or check if you're behind a firewall. – SRJ Oct 15 '22 at 11:15

0 Answers0