We have our data in relational database in single table with columns id and date as this.
productid date value1 value2
1 2005-10-26 24 27
1 2005-10-27 22 28
2 2005-10-26 12 18
Trying to load them to s3 as parquet and create metadata in hive to query them using athena and redshift. Our most frequent queries will be filtering on product id, day, month and year. So trying to load the data partitions in a way to have better query performance.
From what i understood, I can create the partitions like this
s3://my-bucket/my-dataset/dt=2017-07-01/
...
s3://my-bucket/my-dataset/dt=2017-07-09/
s3://my-bucket/my-dataset/dt=2017-07-10/
or like this,
s3://mybucket/year=2017/month=06/day=01/
s3://mybucket/year=2017/month=06/day=02/
...
s3://mybucket/year=2017/month=08/day=31/
- Which will be faster in terms of query as I have 7 years data.
- Also, how can i add partitioning for product id here? So that it will be faster.
- How can i create this (s3://mybucket/year=2017/month=06/day=01/) folder structures with key=value using spark scala.? Any examples?