Spark Version : 3.3.0 pyspark Version: 3.1.1 python Version: 3.7.9
I am trying to work with the functionality of pyspark.pandas. I created a pyspark.pandas dataframe and converted it into spark data frame using df.to_spark() function. After that I registered the df as a temp table using the functionality spark_df.registerTempTable("Table_Name").
Finally, I want to delete the data on basis of a few columns of this temp table.
This part gives me the error that non-determisitic expressions are not allowed.
from pyspark import pandas as ps
from spark_sdk.sql import SparkSql
sql = SparkSql()
data = [['tom', '10'], ['nick', '15'], ['juli', '14']]
# Create the pyspark-pandas DataFrame
df = ps.DataFrame(data, columns=['Name', 'Age'])
pandas_df = df.to_spark()
s_df = pandas_df
#Registering a temp table of DF
s_df.registerTempTable("temp_table")
temp = sql.execute_query(query="select * from temp_table")
print(temp.show())
#Creating a table x
create_query = f"create table x(Name String, Age String)"
sql.execute_query(query=create_query)
#Final Delete query that fails
delete_query = f"delete from x as t1 WHERE t1.name in (SELECT distinct Name from temp_table)"
sql.execute_query(query=delete_query)
The error I get for this is:-
INFO:PSG_LOGGER:Executing Spark Query ..
POOL : default
QUERY : delete from p.x as t1 WHERE t1.name in (SELECT distinct Name from temp_table)
Traceback (most recent call last):
File "/Users/puneetjain/Desktop/repos/psg-notebooks-sdk/notebooks/_version.py", line 40, in <module>
sql.execute_query(query=delete_query)
File "/Users/puneetjain/Desktop/repos/psg-notebooks-sdk/venv/lib/python3.7/site-packages/spark_sdk/utils/spark/sql.py", line 47, in execute_query
all_dfs = [self._execute_single_query(query) for query in queries]
File "/Users/puneetjain/Desktop/repos/psg-notebooks-sdk/venv/lib/python3.7/site-packages/spark_sdk/utils/spark/sql.py", line 47, in <listcomp>
all_dfs = [self._execute_single_query(query) for query in queries]
File "/Users/puneetjain/Desktop/repos/psg-notebooks-sdk/venv/lib/python3.7/site-packages/spark_sdk/utils/spark/sql.py", line 36, in _execute_single_query
return spark.sql(query)
File "/Users/puneetjain/Desktop/repos/psg-notebooks-sdk/venv/lib/python3.7/site-packages/pyspark/sql/session.py", line 1034, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self)
File "/Users/puneetjain/Desktop/repos/psg-notebooks-sdk/venv/lib/python3.7/site-packages/py4j/java_gateway.py", line 1322, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/Users/puneetjain/Desktop/repos/psg-notebooks-sdk/venv/lib/python3.7/site-packages/pyspark/sql/utils.py", line 196, in deco
raise converted from None
pyspark.sql.utils.AnalysisException: nondeterministic expressions are only allowed in
Project, Filter, Aggregate or Window, found:
(t1.name IN (listquery()))
in operator DeleteFromTable name#29 IN (list#28 [])
;
DeleteFromTable name#29 IN (list#28 [])
: +- Distinct
: +- Project [Name#1]
: +- SubqueryAlias temp_table
: +- View (`temp_table`, [Name#1,Age#2])
: +- Project [Name#1, Age#2]
: +- Project [__index_level_0__#0L, Name#1, Age#2, monotonically_increasing_id() AS __natural_order__#6L]
: +- LogicalRDD [__index_level_0__#0L, Name#1, Age#2], false
+- SubqueryAlias t1
+- SubqueryAlias spark_catalog.p.x
+- HiveTableRelation [`p`.`x`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [Name#29, Age#30], Partition Cols: []]
If I use a normal pandas dataframe instead of pyspark-pandas, and use spark.createDataframe(pandas_df)
to create spark dataframe, the code works fine.
Can someone please let me know what is causing the issue here and any way I can resolve this.
Thanks In Advance.