Using spark Notebook in Azure Synapse, I'm processing some data from parquet files, and outputting it as different parquet files. I produced a working script and started applying it to different datasets, all working fine until I cam across a dataset containing dates older than 1900.
For this issue, I came across this article (which I took to be applicable to my scenario): Problems when writing parquet with timestamps prior to 1900 in AWS Glue 3.0
The fix is to add this code chunk, which I did, to the top of my notebook:
%%pyspark
from pyspark import SparkContext
sc = SparkContext()
# Get current sparkconf which is set by glue
conf = sc.getConf()
# add additional spark configurations
conf.set("spark.sql.legacy.parquet.int96RebaseModeInRead", "CORRECTED")
conf.set("spark.sql.legacy.parquet.int96RebaseModeInWrite", "CORRECTED")
conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "CORRECTED")
conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "CORRECTED")
# Restart spark context
sc.stop()
sc = SparkContext.getOrCreate(conf=conf)
# create glue context with the restarted sc
glueContext = GlueContext(sc)
Unfortunately this generated another error:
Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext. : java.lang.IllegalStateException: Promise already completed. at scala.concurrent.Promise.complete(Promise.scala:53) at scala.concurrent.Promise.complete$(Promise.scala:52) at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:187) at scala.concurrent.Promise.success(Promise.scala:86) at scala.concurrent.Promise.success$(Promise.scala:86) at scala.concurrent.impl.Promise$DefaultPromise.success(Promise.scala:187) at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$sparkContextInitialized(ApplicationMaster.scala:408) at org.apache.spark.deploy.yarn.ApplicationMaster$.sparkContextInitialized(ApplicationMaster.scala:910) at org.apache.spark.scheduler.cluster.YarnClusterScheduler.postStartHook(YarnClusterScheduler.scala:32) at org.apache.spark.SparkContext.(SparkContext.scala:683) at org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:238) at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80) at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748)
I've tried looking into resolutions, but this is getting outside of my area of expertise. I want my Synapse spark notebook to run, even on date fields where the date is less than 1900. Any ideas?