Exception in thread "main" org.apache.spark.sql.AnalysisException: Path does not exist

Question

I am running a spark job in Google dataproc cluster version 1.4 and spark version 2.4.5 which reads a file with regular expression in the path from GS bucket and getting below error.

Exception in thread "main" org.apache.spark.sql.AnalysisException: Path does not exist: gs://<gs_path>/<file_name>_\d*.dat;
    at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:552)
    at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:545)

I am able to run the same job in dataproc 1.2 cluster with spark version 2.2.3 and able to read the file from the path fine.

Are there any changes to the way we should form regular expressions in spark 2.4.5 or if there is any changes in the google api of dataproc 1.4 cluster which requires a change in the way I create these paths with regular expressions.

After working with google support, I was asked to disable the flat glob algorithm in the GCS connector by setting these Hadoop properties during cluster core:fs.gs.glob.flatlist.enable=false core:fs.gs.glob.concurrent.enable=false We also upgraded the GCS_CONNECTOR_VERSION from 1.9.17 to 1.9.18. Above issue is resolved after setting those properties when creating the dataproc cluster. — Ananth, Aug 12 '20 at 21:32

score 0 · Answer 1 · answered Aug 12 '20 at 21:36

Issue is resolved after disabling the flat glob algorithm in the GCS connector by setting these Hadoop properties during cluster creation.

core:fs.gs.glob.flatlist.enable=false

core:fs.gs.glob.concurrent.enable=false

We also upgraded the GCS_CONNECTOR_VERSION from 1.9.17 to 1.9.18.

Exception in thread "main" org.apache.spark.sql.AnalysisException: Path does not exist

1 Answers1