0

I am trying to understand Kafka + Pyspark better, and starting off with a test message that I would like to append to a spark dataframe. I can stream data from kafka and read data from CSVs, but I cannot use the createDataframe method for some reason, I always get the following error: :TypeError: 'JavaPackage' object is not callable"

Pyspark Version (pyspark --version) :
____ __ / / ___ / / \ / _ / _ `/ __/ '/ // .__/_,// //_\ version 3.3.1 //

Using Scala version 2.12.15, OpenJDK 64-Bit Server VM, 17.0.6

Python Version: 3.11.3

Code:

# PYSPARK CREATE DATAFRAME FROM dictionary
import pandas as pd
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
from pyspark.sql import SparkSession
from pyspark.sql.functions import from_json, udf
import findspark

findspark.init()

# Config
spark = SparkSession \
    .builder \
    .master("local[*]") \
    .appName("PySparkTest") \
    .getOrCreate()

#Creating the test dictionary I want to append to a spark dataframe
test_message = {'user_id': 19, 'recipient_id': 57, 'message': 'YbfyRHyWgjuGlzOiudEcVMLJNzqUPDvV'}


#put it into a pandas dataframe
df_pandas = pd.DataFrame([test_message])
df_pandas


##  Create a spark schema/column headers
schema = StructType([
    StructField("user_id", IntegerType(), True),
    StructField("recipient_id", IntegerType(), True),
    StructField("message", StringType(), True)
])

df_spark = spark.createDataFrame(df_pandas,schema)
df_spark.show()


# ## ALTERNATIVE WAY: Create DataFrame from a single row
# ### PARSING THE JSON COMING OUT 
# user_id = deserialized_cons.get('user_id')
# recipient_id = deserialized_cons.get('recipient_id')
# message = deserialized_cons.get('message')
# ## turn into row format and create and upload dataframe
# data = [(user_id, recipient_id, message)]
# df_spark = spark.createDataFrame(df_pandas,schema)
# df_spark.show()

ErrorResult:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[25], line 44
     42 ## turn into row format and create and upload dataframe
     43 data = [(user_id, recipient_id, message)]
---> 44 df_spark = spark.createDataFrame(df_pandas,schema)
     45 df_spark.show()

File ~/opt/anaconda3/envs/spark/lib/python3.11/site-packages/pyspark/sql/session.py:1273, in SparkSession.createDataFrame(self, data, schema, samplingRatio, verifySchema)
   1269     data = pd.DataFrame(data, columns=column_names)
   1271 if has_pandas and isinstance(data, pd.DataFrame):
   1272     # Create a DataFrame from pandas DataFrame.
-> 1273     return super(SparkSession, self).createDataFrame(  # type: ignore[call-overload]
   1274         data, schema, samplingRatio, verifySchema
   1275     )
   1276 return self._create_dataframe(
   1277     data, schema, samplingRatio, verifySchema  # type: ignore[arg-type]
   1278 )

File ~/opt/anaconda3/envs/spark/lib/python3.11/site-packages/pyspark/sql/pandas/conversion.py:440, in SparkConversionMixin.createDataFrame(self, data, schema, samplingRatio, verifySchema)
    438             raise
    439 converted_data = self._convert_from_pandas(data, schema, timezone)
--> 440 return self._create_dataframe(converted_data, schema, samplingRatio, verifySchema)

File ~/opt/anaconda3/envs/spark/lib/python3.11/site-packages/pyspark/sql/session.py:1320, in SparkSession._create_dataframe(self, data, schema, samplingRatio, verifySchema)
   1318     rdd, struct = self._createFromLocal(map(prepare, data), schema)
   1319 assert self._jvm is not None
-> 1320 jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
   1321 jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), struct.json())
   1322 df = DataFrame(jdf, self)

File ~/opt/anaconda3/envs/spark/lib/python3.11/site-packages/pyspark/rdd.py:4897, in RDD._to_java_object_rdd(self)
   4894 rdd = self._pickled()
   4895 assert self.ctx._jvm is not None
-> 4897 return self.ctx._jvm.SerDeUtil.pythonToJava(rdd._jrdd, True)

File ~/opt/anaconda3/envs/spark/lib/python3.11/site-packages/pyspark/rdd.py:5441, in PipelinedRDD._jrdd(self)
   5438 else:
   5439     profiler = None
-> 5441 wrapped_func = _wrap_function(
   5442     self.ctx, self.func, self._prev_jrdd_deserializer, self._jrdd_deserializer, profiler
   5443 )
   5445 assert self.ctx._jvm is not None
   5446 python_rdd = self.ctx._jvm.PythonRDD(
   5447     self._prev_jrdd.rdd(), wrapped_func, self.preservesPartitioning, self.is_barrier
   5448 )

File ~/opt/anaconda3/envs/spark/lib/python3.11/site-packages/pyspark/rdd.py:5243, in _wrap_function(sc, func, deserializer, serializer, profiler)
   5241 pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
   5242 assert sc._jvm is not None
-> 5243 return sc._jvm.SimplePythonFunction(
   5244     bytearray(pickled_command),
   5245     env,
   5246     includes,
   5247     sc.pythonExec,
   5248     sc.pythonVer,
   5249     broadcast_vars,
   5250     sc._javaAccumulator,
   5251 )

TypeError: 'JavaPackage' object is not callable```

**Checking SparkVersion:**

$ spark-submit — version
23/05/27 12:19:59 WARN Utils: Your hostname, Nics-MacBook-Pro.local resolves to a loopback address: 127.0.0.1; using 192.168.4.24 instead (on interface en0)
23/05/27 12:19:59 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
23/05/27 12:19:59 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Exception in thread "main" org.apache.spark.SparkException: Failed to get main class in JAR with error 'File file:/Users/nicburkett/— does not exist'.  Please specify one with --class.


Tried to create spark dataframe from tuple, and create spark dataframe from Pandas dataframe, both were giving the JavaPackage Error.

Tried other spark commands such as read and they work, tried to troubleshoot having the wrong spark and java version to no avail. 

Thinking I maybe need to add a another jar file? But this seems like a basic function.. 

0 Answers0