2

I'm trying to read a kinesis stream using spark / python in a jupyter notebook provided by AWS. I took the code from AWS documentation but when I tried to create a dataframe with kinesis I get a dependency error. I thought that all the dependencies were good because I created a notebook "SparkMagic PySpark". Here is my code:

import sys
from datetime import datetime
import boto3
import base64
from pyspark.sql import DataFrame, Row
from pyspark.context import SparkContext
from pyspark.sql.types import *
from pyspark.sql.functions import *
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue import DynamicFrame
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kinesis import KinesisUtils, InitialPositionInStream


sc = SparkContext.getOrCreate();
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)

#ssc = StreamingContext(sc, 1)

data_frame_DataSource0 = glueContext.create_data_frame.from_catalog(database = "***", table_name = "***", transformation_ctx = "DataSource0", additional_options = {"startingPosition":"latest","inferSchema":"false"})

print ("Start")


job.commit()

and Here is the error I get:

I went on the website with the spark libraries but I don't really now which one is missing and how to add it into a notebook.

Wai Ha Lee
  • 8,598
  • 83
  • 57
  • 92

1 Answers1

1

I had this same problem, and after much application of head-to-wall, managed to come up with the following which seems to solve this:

The short answer:

  1. Download the jar for com.qubole.spark:spark-sql-kinesis_2.11 (at least version 1.2.0_spark-2.4 seems to work for me)
  2. Place the jar in some S3 bucket you and your Glue development endpoint IAM role has access to
  3. When deploying the Glue development endpoint, set the "dependent jars path" to point to the jar (note: not the directory, but the object itself!) you put in the S3 bucket
  4. Make your notebook point at the development endpoint configured thusly
  5. Your code should now work, at least to the point of no longer producing the getDataFrame error about kinesis data sources. I have not yet tested it beyond that.

A longer explanation + caveat:

This answer to a more general question about the missing data source error pointed me towards the dependency that seems to resolve the issue with glueContext.create_data_frame.from_catalog using a kinesis source, though I don't know much about what the dependency does (or how it e.g. differs from the functionality that the equivalent Streaming Glue Jobs perform) beyond what its Github page describes it as being an implementation of a Kinesis connector.

There also seems to be a funny thing with the SageMaker Notebook instances that listing the jars in use by the Spark Context (as described here) produces an empty list, regardless of what jars you point the dev endpoint at. However, connecting via SSH to the endpoint's Python REPL (check the endpoint details for how to do that) produces the following for me:

>>> print(sc._jsc.sc().listJars())
ArrayBuffer(spark://foo.compute.internal:38875/jars/glue-assembly.jar, spark://foo.compute.internal:38875/jars/java-deps.jar, spark://foo.compute.internal:38875/jars/sagemaker-spark_2.11-spark_2.4.0-1.2.1.jar)

In that listing the java-deps.jar is the spark-sql-kinesis jar that I put in my S3 bucket. This might help you in verifying that your endpoint actually loaded the dependency, despite the Notebook claiming there's no additional jars loaded. Regardless, the Notebook instance does seem to make use of the loaded jars listed via the REPL.

I'm not fully happy (because I don't like adding dependencies to strange jars) with the solution I came up for this problem, but I'm still listing it here for posterity before someone comes up with a better solution.

Aleksi
  • 31
  • 4