0

Often times, data is available with a folder structure like,

2000-01-01/john/smith

rather than the Hive partition spec,

date=2000-01-01/first_name=john/last_name=smith

Spark (and pyspark) can read partitioned data easily when using the Hive folder structure, but with the "bad" folder structure it becomes difficult and involved regex and things.

Is there an easier way to deal with non-hive folder structure for partitioned data in Spark?

  • 1
    It may make sense to do onetime maintenance to refactor the folder structure. I don't see how Spark can get around a bad data structure. – Salim Feb 13 '20 at 19:18
  • 2
    AFAIK at the moment Spark doesn't provide any real optimizations for partitioned data, so as long you understand the semantics you can just take [`input_file_name`](https://stackoverflow.com/q/39868263/10465355) and split it the path into fields. – 10465355 Feb 13 '20 at 19:21

0 Answers0