1

I am running a spark job in Google dataproc cluster version 1.4 and spark version 2.4.5 which reads a file with regular expression in the path from GS bucket and getting below error.

Exception in thread "main" org.apache.spark.sql.AnalysisException: Path does not exist: gs://<gs_path>/<file_name>_\d*.dat;
    at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:552)
    at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:545)

I am able to run the same job in dataproc 1.2 cluster with spark version 2.2.3 and able to read the file from the path fine.

Are there any changes to the way we should form regular expressions in spark 2.4.5 or if there is any changes in the google api of dataproc 1.4 cluster which requires a change in the way I create these paths with regular expressions.

Ananth
  • 21
  • 2
  • Could you share your code so I can investigate further ? – Alexandre Moraes Aug 10 '20 at 08:36
  • After working with google support, I was asked to disable the flat glob algorithm in the GCS connector by setting these Hadoop properties during cluster core:fs.gs.glob.flatlist.enable=false core:fs.gs.glob.concurrent.enable=false We also upgraded the GCS_CONNECTOR_VERSION from 1.9.17 to 1.9.18. Above issue is resolved after setting those properties when creating the dataproc cluster. – Ananth Aug 12 '20 at 21:32

1 Answers1

0

Issue is resolved after disabling the flat glob algorithm in the GCS connector by setting these Hadoop properties during cluster creation.

core:fs.gs.glob.flatlist.enable=false

core:fs.gs.glob.concurrent.enable=false

We also upgraded the GCS_CONNECTOR_VERSION from 1.9.17 to 1.9.18.

Ananth
  • 21
  • 2