Simple way to deal with poor folder structure for partitions in Apache Spark

Asked Feb 13 '20 at 18:56

Active Feb 14 '20 at 13:36

Viewed 232 times

Often times, data is available with a folder structure like,

2000-01-01/john/smith

rather than the Hive partition spec,

date=2000-01-01/first_name=john/last_name=smith

Spark (and pyspark) can read partitioned data easily when using the Hive folder structure, but with the "bad" folder structure it becomes difficult and involved regex and things.

Is there an easier way to deal with non-hive folder structure for partitioned data in Spark?

asked Feb 13 '20 at 18:56

1

It may make sense to do onetime maintenance to refactor the folder structure. I don't see how Spark can get around a bad data structure. – Salim Feb 13 '20 at 19:18
2

AFAIK at the moment Spark doesn't provide any real optimizations for partitioned data, so as long you understand the semantics you can just take [`input_file_name`](https://stackoverflow.com/q/39868263/10465355) and split it the path into fields. – 10465355 Feb 13 '20 at 19:21

Simple way to deal with poor folder structure for partitions in Apache Spark

0 Answers0