0

Spark Version : 3.3.0 pyspark Version: 3.1.1 python Version: 3.7.9

I am trying to work with the functionality of pyspark.pandas. I created a pyspark.pandas dataframe and converted it into spark data frame using df.to_spark() function. After that I registered the df as a temp table using the functionality spark_df.registerTempTable("Table_Name").

Finally, I want to delete the data on basis of a few columns of this temp table.

This part gives me the error that non-determisitic expressions are not allowed.

from pyspark import pandas as ps
from spark_sdk.sql import SparkSql
sql = SparkSql()

data = [['tom', '10'], ['nick', '15'], ['juli', '14']]

# Create the pyspark-pandas DataFrame
df = ps.DataFrame(data, columns=['Name', 'Age'])
pandas_df = df.to_spark()
s_df = pandas_df

#Registering a temp table of DF
s_df.registerTempTable("temp_table")
temp = sql.execute_query(query="select * from temp_table")
print(temp.show())

#Creating a table x 
create_query = f"create table x(Name String, Age String)"
sql.execute_query(query=create_query)

#Final Delete query that fails
delete_query = f"delete from x as t1 WHERE t1.name in (SELECT distinct Name from temp_table)"
sql.execute_query(query=delete_query)

The error I get for this is:-

INFO:PSG_LOGGER:Executing Spark Query .. 
    POOL : default 
    QUERY : delete from p.x as t1 WHERE t1.name in (SELECT distinct Name from temp_table)
Traceback (most recent call last):
  File "/Users/puneetjain/Desktop/repos/psg-notebooks-sdk/notebooks/_version.py", line 40, in <module>
    sql.execute_query(query=delete_query)
  File "/Users/puneetjain/Desktop/repos/psg-notebooks-sdk/venv/lib/python3.7/site-packages/spark_sdk/utils/spark/sql.py", line 47, in execute_query
    all_dfs = [self._execute_single_query(query) for query in queries]
  File "/Users/puneetjain/Desktop/repos/psg-notebooks-sdk/venv/lib/python3.7/site-packages/spark_sdk/utils/spark/sql.py", line 47, in <listcomp>
    all_dfs = [self._execute_single_query(query) for query in queries]
  File "/Users/puneetjain/Desktop/repos/psg-notebooks-sdk/venv/lib/python3.7/site-packages/spark_sdk/utils/spark/sql.py", line 36, in _execute_single_query
    return spark.sql(query)
  File "/Users/puneetjain/Desktop/repos/psg-notebooks-sdk/venv/lib/python3.7/site-packages/pyspark/sql/session.py", line 1034, in sql
    return DataFrame(self._jsparkSession.sql(sqlQuery), self)
  File "/Users/puneetjain/Desktop/repos/psg-notebooks-sdk/venv/lib/python3.7/site-packages/py4j/java_gateway.py", line 1322, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/Users/puneetjain/Desktop/repos/psg-notebooks-sdk/venv/lib/python3.7/site-packages/pyspark/sql/utils.py", line 196, in deco
    raise converted from None
pyspark.sql.utils.AnalysisException: nondeterministic expressions are only allowed in
Project, Filter, Aggregate or Window, found:
 (t1.name IN (listquery()))
in operator DeleteFromTable name#29 IN (list#28 [])
               ;
DeleteFromTable name#29 IN (list#28 [])
:  +- Distinct
:     +- Project [Name#1]
:        +- SubqueryAlias temp_table
:           +- View (`temp_table`, [Name#1,Age#2])
:              +- Project [Name#1, Age#2]
:                 +- Project [__index_level_0__#0L, Name#1, Age#2, monotonically_increasing_id() AS __natural_order__#6L]
:                    +- LogicalRDD [__index_level_0__#0L, Name#1, Age#2], false
+- SubqueryAlias t1
   +- SubqueryAlias spark_catalog.p.x
      +- HiveTableRelation [`p`.`x`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [Name#29, Age#30], Partition Cols: []]

If I use a normal pandas dataframe instead of pyspark-pandas, and use spark.createDataframe(pandas_df) to create spark dataframe, the code works fine.

Can someone please let me know what is causing the issue here and any way I can resolve this.

Thanks In Advance.

  • is the query using `EXISTS` correct? [refer](https://www.w3schools.com/sql/sql_exists.asp) – samkart Jul 30 '22 at 18:44
  • Yes, It uses exists. the final query that's giving error is : `delete from x as t1 WHERE EXISTS (SELECT distinct Name from temp_table)`. – Puneet Jain Jul 31 '22 at 06:49
  • shouldn't you be checking the names against some column in the x table? I'm guessing what you mean to do is `where in ( select distinct name from temp_table)`. Or, `where exists (select distinct name from temp where x.=temp.name)` – samkart Jul 31 '22 at 08:05
  • Sorry, my bad, you are correct I should be checking names against a column value. The final query should be `delete from x as t1 WHERE t1.name in (SELECT distinct Name from temp_table)`. I tried running code with this query. I have the same error. – Puneet Jain Jul 31 '22 at 08:41
  • using `in` is giving you an exists() error? Ease update the error traceback using the new query – samkart Jul 31 '22 at 09:20
  • Hey I have updated the traceback @samkart – Puneet Jain Aug 01 '22 at 05:42

0 Answers0