Convert folders structure to partitions on S3 using Spark

Question

I have a lot of data on S3 which are in folder instead of partitions. The structure looks like this:

## s3://bucket/countryname/year/weeknumber/a.csv

s3://Countries/Canada/2019/20/part-1.csv
s3://Countries/Canada/2019/20/part-2.csv
s3://Countries/Canada/2019/20/part-3.csv

s3://Countries/Canada/2019/21/part-1.csv
s3://Countries/Canada/2019/21/part-2.csv

Is there any way to convert that data as parititons. Something like this:

s3://Countries/Country=Canada/Year=2019/Week=20/part-1.csv
s3://Countries/Country=Canada/Year=2019/Week=20/part-2.csv
s3://Countries/Country=Canada/Year=2019/Week=20/part-3.csv

s3://Countries/Country=Canada/Year=2019/Week=21/part-1.csv
s3://Countries/Country=Canada/Year=2019/Week=21/part-2.csv

I have no clue how to do this, instead of having a for loop which iterates over all the folders and load the data, which is complex.

Any help will be appreciated.

Check This. https://stackoverflow.com/questions/16897778/can-i-point-multiple-location-to-same-hive-external-table — Lamanus, Jul 31 '19 at 09:38
Your data is manually partitioned with week number and you just need to set it as the partition. — Lamanus, Jul 31 '19 at 09:42
by running crawler over Countries, it will create partitions, partition_0, partition_1, partition_2. You can rename them either through Glue console or through glue pyspark code. — Sandeep Fatangare, Aug 13 '19 at 20:59

score 1 · Accepted Answer · answered Aug 01 '19 at 12:13

Hive style paths isn't always necessary for partitioning. I got to this question from another question you wrote in the context of Athena, so I'm going to guess that the underlying metastore is in fact Glue, and that you're really targeting Athena (I added the amazon-athena tag to your question).

In Presto, or Athena/Glue you can add partitions with for any kind of path, as long as the prefixes don't overlap. For example, you to add the partitions in your first example you would do this:

ALTER TABLE table_name ADD IF NOT EXISTS
  PARTITION (country = 'Canada', year_week = '2019-20') LOCATION 's3://Countries/Canada/2019/20/'
  PARTITION (country = 'Canada', year_week = '2019-21') LOCATION 's3://Countries/Canada/2019/21/'

This assumes there is a year_week column, but you could have year and week as separate columns if you want (and do (country = 'Canada', year = '2019', week = '20')), either works.

Why are almost all Athena examples using Hive style paths (e.g. country=Canada/year=2019/week=20/part-1.csv)? Part of it is for historical reasons, IIRC Hive doesn't support any other scheme, partitioning and paths are tightly coupled. Another reason is that the Athena/Presto command MSCK REPAIR TABLE works only with that style of partitioning (but you want to avoid relying on that command anyway). There are also other tools that assume, or work with that style and no other. If you aren't using those, then it doesn't matter.

If you absolutely must use Hive style partitioning, there is a feature that lets you create "symlinks" to files in a separate path structure. You can find instructions on how to do it here: https://stackoverflow.com/a/55069330/1109 – but keep in mind that this means that you'll have to keep that other path structure up to date. If you don't have to use Hive style paths for your partitions, I would advice that you don't bother with the added complexity.

Thank you again for answering this. Just quick quetion, wouldn't it be complex to generate that query for all the countries and months and days? I guess I need to use boto or some other library to read the keys and generate the query using that. Am I correct? — Waqar Ahmed, Aug 01 '19 at 12:22
I usually write a script that scans S3 (or use S3 Inventory to get a listing) and generates the `ALTER TABLE` SQL. You can use [`list_objects_v2`](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.list_objects_v2) and the `Delimiter` and `Prefix` options to recursively list the structure, but avoid listing every single object (this is in fact what Athena does when you query an unpartitioned table). — Theo, Aug 01 '19 at 12:26
Great! Thank you for the hint. I will try that and in case of no questions anymore. I will accept it :) — Waqar Ahmed, Aug 01 '19 at 12:27

Convert folders structure to partitions on S3 using Spark

1 Answers1

Linked