1

While trying to repartition a delta lake table with partitions as date(yyyy-MM-dd) and time (hhmm). I'm getting error as :

File "/usr/local/lib/python3.7/site-packages/pyspark/sql/readwriter.py", line 739, in save
    self._jwrite.save(path)
File "/usr/local/lib/python3.7/site-packages/py4j/java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
File "/usr/local/lib/python3.7/site-packages/pyspark/sql/utils.py", line 69, in deco
    raise AnalysisException(s.split(': ', 1)[1], stackTrace)
    pyspark.sql.utils.AnalysisException: "Cannot recognize the predicate 'Column<b'((partitionTime = 1357) AND (partitionDate = 2020-10-27))'>';"

I'm able to query both the partitions individually but when I do both at once I'm getting above error

spark \
 .read.format("delta") \
 .load(table_path) \
 .where(((sf.col("partitionTime") == "1357") & (sf.col("partitionDate") == "2020-10-27"))) \
 .repartition(n_partitions) \
 .write \
 .option("dataChange", "false") \
 .format("delta") \
 .mode("overwrite") \
 .option("replaceWhere", ((sf.col("partitionTime") == "1357") & (sf.col("partitionDate") == "2020-10-27") )) \
                .save(table_path)

Wondering what could cause this issue ! I did follow the documentation from delta.io

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
Vivek Jain
  • 23
  • 4

1 Answers1

0

Try it like this:

.option("replaceWhere", "partitionTime='1357' AND partitionDate='2020-10-27'")

It looks like the replaceWhere option is not taking the PySpark syntax which is also used for the where.

In replaceWhere option you have to use SQL syntax.

elyptikus
  • 936
  • 8
  • 24