I am trying to run a script using pyspark to do transformations in aws in a docker container.
When I do docker run it displays in the console:
:: loading settings :: url = jar:file:/usr/local/lib/python3.9/site-packages/pyspark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
org.apache.hadoop#hadoop-aws added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-b351103a-0cb0- 46f6-9894-49096e91d74a;1.0
confs: [default]
found org.apache.hadoop#hadoop-aws;3.3.4 in central
found com.amazonaws#aws-java-sdk-bundle;1.12.262 in central
found org.wildfly.openssl#wildfly-openssl;1.0.7.Final in central
downloading https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar ...
[SUCCESSFUL ] org.apache.hadoop#hadoop-aws;3.3.4!hadoop-aws.jar (104510ms)
downloading https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk- bundle/1.12.262/aws-java-sdk-bundle-1.12.262.jar ...
downloading https://repo1.maven.org/maven2/org/wildfly/openssl/wildfly-openssl/1.0.7.Final/wildfly-openssl-1.0.7.Final.jar ...
[SUCCESSFUL ] org.wildfly.openssl#wildfly-openssl;1.0.7.Final!wildfly- openssl.jar (27243ms)
However after a few seconds the following error arises:
:: problems summary ::
:::: WARNINGS
[FAILED ] com.amazonaws#aws-java-sdk-bundle;1.12.262!aws-java-sdk- bundle.jar: Downloaded file size (9846784) doesn't match expected Content Length (280645251) for https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.262/aws-java-sdk-bundle-1.12.262.jar. Please retry. (1218943ms)
[FAILED ] com.amazonaws#aws-java-sdk-bundle;1.12.262!aws-java-sdk-bundle.jar: Downloaded file size (9846784) doesn't match expected Content Length (280645251) for https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.262/aws-java-sdk-bundle-1.12.262.jar. Please retry. (1218943ms)
==== central: tried
https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.262/aws-java-sdk-bundle-1.12.262.jar
::::::::::::::::::::::::::::::::::::::::::::::
:: FAILED DOWNLOADS ::
:: ^ see resolution messages for details ^ ::
::::::::::::::::::::::::::::::::::::::::::::::
:: com.amazonaws#aws-java-sdk-bundle;1.12.262!aws-java-sdk-bundle.jar
::::::::::::::::::::::::::::::::::::::::::::::
I then get the error:
RuntimeError: Java gateway process exited before sending its port number
In Pyspark I use the .write.parquet method with an s3a link to write to aws. My Dockerfile looks the following:
ARG IMAGE_VARIANT=slim-buster
ARG OPENJDK_VERSION=8
ARG PYTHON_VERSION=3.9.8
FROM python:${PYTHON_VERSION}-${IMAGE_VARIANT} AS py3
FROM openjdk:${OPENJDK_VERSION}-${IMAGE_VARIANT}
COPY --from=py3 / /
ARG PYSPARK_VERSION=3.2.0
RUN pip --no-cache-dir install pyspark==${PYSPARK_VERSION} pandas datetime boto3 botocore uuid
COPY /housing_prices_2022_10.csv /app/housing_prices_2022_10.csv
ADD script.py /
CMD [ "python", "./script.py" ]
The python script looks like this:
os.environ["AWS_ACCESS_KEY_ID"] = ""
os.environ["AWS_SECRET_ACCESS_KEY"] = ""
spark = SparkSession \
.builder \
.config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:3.3.4") \
.getOrCreate()
housing_prices= spark.read.option("header", "true").csv(complete_raw_path)
housing_prices.distinct()
housing_prices.write.parquet( curated_path,
mode = "overwrite")