using PySpark ORC path with a white space

Question

I am having an issue with a line of code that used to work fine in Spark 1.6 and doesn’t work in Spark 2.2. The error is java.io.FileNotFoundException: File does not exist:

Note there is a white space in the file path. The space is after the yyyy-mm-dd.

hdfs://hadoop/path/part_date=2018-04-20 15%3A01%3A21/000000_0

That might be causing the problem. How can I get around this.

df = spark.read.format('orc').load('hdfs://hadoop/path/part_date=2018-04-20%2015%253A01%253A21/000000_0')
df.show()

Have you tried `.load("hdfs://hadoop/path/part_date=2018-04-20 15%3A01%3A21/000000_0")` (with the space instead of `%20`)? Or perhaps escaping it? `.load("hdfs://hadoop/path/part_date=2018-04-20\ 15%3A01%3A21/000000_0")`? — pault, Apr 24 '18 at 19:32
@pault Tried both (with space and escape with space, bot occassions got same error, "java.io.FileNotFoundException: File does not exist: hdfs://hadoop/path/part_date=2018-04-20%2015%253A01%253A21/000000_0". It is adding the encoding %20 by itself, is it? — Tronald Dump, Apr 24 '18 at 19:46
According to [this post](https://stackoverflow.com/questions/11565694/java-io-filenotfoundexception-on-an-existing-file) using spaces should work. I'm not sure why it's not. Shouldn't make a difference, but maybe try using a raw string with the space? For example: `.load(r"hdfs://hadoop/path/part_date=2018-04-20 15%3A01%3A21/000000_0")` — pault, Apr 24 '18 at 19:55
@pault , That is not making any difference. Spark upgrade from 1.6 to 2.2 must have missed something. — Tronald Dump, Apr 24 '18 at 20:36

using PySpark ORC path with a white space

0 Answers0