Pyspark in Azure - need to configure sparkContext

Question

Using spark Notebook in Azure Synapse, I'm processing some data from parquet files, and outputting it as different parquet files. I produced a working script and started applying it to different datasets, all working fine until I cam across a dataset containing dates older than 1900.

For this issue, I came across this article (which I took to be applicable to my scenario): Problems when writing parquet with timestamps prior to 1900 in AWS Glue 3.0

The fix is to add this code chunk, which I did, to the top of my notebook:

%%pyspark
from pyspark import SparkContext
sc = SparkContext()
# Get current sparkconf which is set by glue
conf = sc.getConf()
# add additional spark configurations
conf.set("spark.sql.legacy.parquet.int96RebaseModeInRead", "CORRECTED")
conf.set("spark.sql.legacy.parquet.int96RebaseModeInWrite", "CORRECTED")
conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "CORRECTED")
conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "CORRECTED")
# Restart spark context
sc.stop()
sc = SparkContext.getOrCreate(conf=conf)
# create glue context with the restarted sc
glueContext = GlueContext(sc)

Unfortunately this generated another error:

Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext. : java.lang.IllegalStateException: Promise already completed. at scala.concurrent.Promise.complete(Promise.scala:53) at scala.concurrent.Promise.complete$(Promise.scala:52) at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:187) at scala.concurrent.Promise.success(Promise.scala:86) at scala.concurrent.Promise.success$(Promise.scala:86) at scala.concurrent.impl.Promise$DefaultPromise.success(Promise.scala:187) at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$sparkContextInitialized(ApplicationMaster.scala:408) at org.apache.spark.deploy.yarn.ApplicationMaster$.sparkContextInitialized(ApplicationMaster.scala:910) at org.apache.spark.scheduler.cluster.YarnClusterScheduler.postStartHook(YarnClusterScheduler.scala:32) at org.apache.spark.SparkContext.(SparkContext.scala:683) at org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:238) at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80) at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748)

I've tried looking into resolutions, but this is getting outside of my area of expertise. I want my Synapse spark notebook to run, even on date fields where the date is less than 1900. Any ideas?

score 1 · Answer 1 · answered Jan 17 '23 at 01:26

I was able to solve this problem by changing the overall configuration for my spark pool (which you will probably want to do as well, unless you want to add config code to every notebook you make). To do this, open up Synapse Studio, then go Manage > Apache Spark pools, click the three dots by your pool (which will be hidden until you mouse over them, great design Microsoft), then select Apache Spark configuration.

From there, create a new configuration, and add a configuration property. For the property, enter spark.sql.parquet.int96RebaseModeInRead and the value enter CORRECTED. Note that spark.sql.parquet.int96RebaseModeInRead does NOT show up as a suggested property, you have to enter it yourself.

Apply your changes, save everything, and make sure your new configuration is selected. It might take a bit for the new changes to be reflected in your notebooks, but it should work from there. If you notice some funky date issues with older dates, try changing CORRECTED to LEGACY.

Pyspark in Azure - need to configure sparkContext

1 Answers1