I am trying to understand Kafka + Pyspark better, and starting off with a test message that I would like to append to a spark dataframe. I can stream data from kafka and read data from CSVs, but I cannot use the createDataframe method for some reason, I always get the following error: :TypeError: 'JavaPackage' object is not callable"
Pyspark Version (pyspark --version) :
____ __
/ / ___ / /
\ / _ / _ `/ __/ '/
// .__/_,// //_\ version 3.3.1
//
Using Scala version 2.12.15, OpenJDK 64-Bit Server VM, 17.0.6
Python Version: 3.11.3
Code:
# PYSPARK CREATE DATAFRAME FROM dictionary
import pandas as pd
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
from pyspark.sql import SparkSession
from pyspark.sql.functions import from_json, udf
import findspark
findspark.init()
# Config
spark = SparkSession \
.builder \
.master("local[*]") \
.appName("PySparkTest") \
.getOrCreate()
#Creating the test dictionary I want to append to a spark dataframe
test_message = {'user_id': 19, 'recipient_id': 57, 'message': 'YbfyRHyWgjuGlzOiudEcVMLJNzqUPDvV'}
#put it into a pandas dataframe
df_pandas = pd.DataFrame([test_message])
df_pandas
## Create a spark schema/column headers
schema = StructType([
StructField("user_id", IntegerType(), True),
StructField("recipient_id", IntegerType(), True),
StructField("message", StringType(), True)
])
df_spark = spark.createDataFrame(df_pandas,schema)
df_spark.show()
# ## ALTERNATIVE WAY: Create DataFrame from a single row
# ### PARSING THE JSON COMING OUT
# user_id = deserialized_cons.get('user_id')
# recipient_id = deserialized_cons.get('recipient_id')
# message = deserialized_cons.get('message')
# ## turn into row format and create and upload dataframe
# data = [(user_id, recipient_id, message)]
# df_spark = spark.createDataFrame(df_pandas,schema)
# df_spark.show()
ErrorResult:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[25], line 44
42 ## turn into row format and create and upload dataframe
43 data = [(user_id, recipient_id, message)]
---> 44 df_spark = spark.createDataFrame(df_pandas,schema)
45 df_spark.show()
File ~/opt/anaconda3/envs/spark/lib/python3.11/site-packages/pyspark/sql/session.py:1273, in SparkSession.createDataFrame(self, data, schema, samplingRatio, verifySchema)
1269 data = pd.DataFrame(data, columns=column_names)
1271 if has_pandas and isinstance(data, pd.DataFrame):
1272 # Create a DataFrame from pandas DataFrame.
-> 1273 return super(SparkSession, self).createDataFrame( # type: ignore[call-overload]
1274 data, schema, samplingRatio, verifySchema
1275 )
1276 return self._create_dataframe(
1277 data, schema, samplingRatio, verifySchema # type: ignore[arg-type]
1278 )
File ~/opt/anaconda3/envs/spark/lib/python3.11/site-packages/pyspark/sql/pandas/conversion.py:440, in SparkConversionMixin.createDataFrame(self, data, schema, samplingRatio, verifySchema)
438 raise
439 converted_data = self._convert_from_pandas(data, schema, timezone)
--> 440 return self._create_dataframe(converted_data, schema, samplingRatio, verifySchema)
File ~/opt/anaconda3/envs/spark/lib/python3.11/site-packages/pyspark/sql/session.py:1320, in SparkSession._create_dataframe(self, data, schema, samplingRatio, verifySchema)
1318 rdd, struct = self._createFromLocal(map(prepare, data), schema)
1319 assert self._jvm is not None
-> 1320 jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
1321 jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), struct.json())
1322 df = DataFrame(jdf, self)
File ~/opt/anaconda3/envs/spark/lib/python3.11/site-packages/pyspark/rdd.py:4897, in RDD._to_java_object_rdd(self)
4894 rdd = self._pickled()
4895 assert self.ctx._jvm is not None
-> 4897 return self.ctx._jvm.SerDeUtil.pythonToJava(rdd._jrdd, True)
File ~/opt/anaconda3/envs/spark/lib/python3.11/site-packages/pyspark/rdd.py:5441, in PipelinedRDD._jrdd(self)
5438 else:
5439 profiler = None
-> 5441 wrapped_func = _wrap_function(
5442 self.ctx, self.func, self._prev_jrdd_deserializer, self._jrdd_deserializer, profiler
5443 )
5445 assert self.ctx._jvm is not None
5446 python_rdd = self.ctx._jvm.PythonRDD(
5447 self._prev_jrdd.rdd(), wrapped_func, self.preservesPartitioning, self.is_barrier
5448 )
File ~/opt/anaconda3/envs/spark/lib/python3.11/site-packages/pyspark/rdd.py:5243, in _wrap_function(sc, func, deserializer, serializer, profiler)
5241 pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
5242 assert sc._jvm is not None
-> 5243 return sc._jvm.SimplePythonFunction(
5244 bytearray(pickled_command),
5245 env,
5246 includes,
5247 sc.pythonExec,
5248 sc.pythonVer,
5249 broadcast_vars,
5250 sc._javaAccumulator,
5251 )
TypeError: 'JavaPackage' object is not callable```
**Checking SparkVersion:**
$ spark-submit — version
23/05/27 12:19:59 WARN Utils: Your hostname, Nics-MacBook-Pro.local resolves to a loopback address: 127.0.0.1; using 192.168.4.24 instead (on interface en0)
23/05/27 12:19:59 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
23/05/27 12:19:59 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Exception in thread "main" org.apache.spark.SparkException: Failed to get main class in JAR with error 'File file:/Users/nicburkett/— does not exist'. Please specify one with --class.
Tried to create spark dataframe from tuple, and create spark dataframe from Pandas dataframe, both were giving the JavaPackage Error.
Tried other spark commands such as read and they work, tried to troubleshoot having the wrong spark and java version to no avail.
Thinking I maybe need to add a another jar file? But this seems like a basic function..