2

Environment(s):

  • Azure blob storage and Local File System
  • Scala 2.12.10/Spark 3.0.1

With a file existing at C:\path\to\any\file-with-[brackets].csv,

spark.read.csv("C:\\path\\to\\any\\file-with-[brackets].csv")

results in

org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:/path/to/any/file-with-[brackets].csv;
  at org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:621)
...

If I remove the brackets both from the file and the path string, no problem:

spark.read.csv("C:\\path\\to\\any\\file-without-brackets.csv")

results in

csv: org.apache.spark.sql.DataFrame = [_c0: string]

(I cannot believe this hasn't been seen before, I find no mention of it on google or stackoverflow.)

How do you reference a file with brackets in the filename in Spark?

UPDATE ! I discovered that brackets are used in the glob syntax of spark's file pathing (you can use wildcards in spark file paths, and [1234] looks for any character 1, 2, 3, or 4 in the position in the file path, like a REGEX)... But I cannot figure out how to ESCAPE a bracket in this context.

How do you escape Spark Wildcard functionality around square brackets when referencing a singular file with literal square brackets in a file path with spark's DataFrameReader?

Rimer
  • 2,054
  • 6
  • 28
  • 43

1 Answers1

1

Similar kind of issue in hadoop has been discussed here.

Spark uses hadoop libraries to read files. If you look at the stack trace and check the line which throwing the message, it points to checkAndGlobPathIfNecessary method of Datasource.scala file. Internally getUri method is getting called, may be issue is occurring there. We can escape [ with %5B and ] with %5D as discussed here, but this is incase of URL. Not sure how to achieve this in string as csv() or load() method accepts string not URI.

Mohana B C
  • 5,021
  • 1
  • 9
  • 28
  • 1
    It's because it's expecting regex notation inside the brackets to allow more precise glob'ing, so they aren't interpreted as part of the path. – Rimer Oct 22 '21 at 20:34