5

We have our data in relational database in single table with columns id and date as this.

productid    date    value1 value2
1         2005-10-26  24    27
1         2005-10-27  22    28
2         2005-10-26  12    18

Trying to load them to s3 as parquet and create metadata in hive to query them using athena and redshift. Our most frequent queries will be filtering on product id, day, month and year. So trying to load the data partitions in a way to have better query performance.

From what i understood, I can create the partitions like this

s3://my-bucket/my-dataset/dt=2017-07-01/   
...
s3://my-bucket/my-dataset/dt=2017-07-09/   
s3://my-bucket/my-dataset/dt=2017-07-10/

or like this,

s3://mybucket/year=2017/month=06/day=01/
s3://mybucket/year=2017/month=06/day=02/
...
s3://mybucket/year=2017/month=08/day=31/
  1. Which will be faster in terms of query as I have 7 years data.
  2. Also, how can i add partitioning for product id here? So that it will be faster.
  3. How can i create this (s3://mybucket/year=2017/month=06/day=01/) folder structures with key=value using spark scala.? Any examples?
ds_user
  • 2,139
  • 4
  • 36
  • 71
  • 2
    What are you querying **by**? Years, or products, or a combination of both? How many unique products are you supporting? If it's a very long ID, then you should partition by a prefix of the ID. Hive performance starts to fall over around 10000 partitions, from what I've read. 365*12*7 is far too many – OneCricketeer Mar 21 '18 at 03:27
  • Going to build an API over it, which will allow to query using either year, month, day or product id. And its not 365*12*7, it is 365*7 right? – ds_user Mar 21 '18 at 03:31
  • Oops, yeah. Sorry, 7*365... Anyway. The API shoudn't really matter. It's the Spark code going to be slow if you don't optimize better partitioning schemes – OneCricketeer Mar 21 '18 at 03:34
  • Yeah thats why trying to find out a better partitions for this. Also, there are more than 20000 products, that will make it worst if i include product and day as partition. – ds_user Mar 21 '18 at 03:36
  • If you partition by just the first digit of the product, then you can optimize a scan over the dataset by a factor of 10. If you are trying to optimize some type of database lookup, then RDS (or any database) might be a better option than an object store. – OneCricketeer Mar 21 '18 at 03:40
  • Ok, that makes sense. But if i try to partition by first digit of the product, i need to have that as a column in the dataset. right? – ds_user Mar 21 '18 at 03:41
  • Correct. You would have something like `productid_key,productId,date,value1,value2` – OneCricketeer Mar 21 '18 at 03:42
  • Cool. Can you refer to any sample snippet for this kind of parquet writing as partition to s3? – ds_user Mar 21 '18 at 03:44
  • 2
    You can `partitionBy("field1", "field2")` on a DataFrame... Then you use `.parquet("s3://path")` to write to disk – OneCricketeer Mar 21 '18 at 05:13

1 Answers1

3

We partitioned like this,

s3://bucket/year/month/year/day/hour/minute/product/region/availabilityzone/

s3://bucketname/2018/03/01/11/30/nest/e1/e1a

minute is rounded to 30 mins. If you traffic is high, you can go for higher resolution for minutes or you can reduce by hour or even by day.

It helped a lot based on what data we want to query (using Athena or Redshift Spectrum) and for what time duration.

Hope it helps.

Kannaiyan
  • 12,554
  • 3
  • 44
  • 83
  • Thanks for your answer. Can you please help on my 3rd question? Sample snippet will be helpful. – ds_user Mar 21 '18 at 03:09
  • 1
    But if i dont put year= in the path, we will have to add the partitions manually as mentioned here - https://forums.aws.amazon.com/thread.jspa?threadID=257650&tstart=0 – ds_user Mar 21 '18 at 03:12
  • Hive and related tools require `key=value` partitions, last time I checked – OneCricketeer Mar 21 '18 at 03:19
  • https://stackoverflow.com/questions/36927918/using-spark-to-write-a-parquet-file-to-s3-over-s3a-is-very-slow. I will remove the answer with =. We have using with datasets in GB everyday and does not seems to affect with speed. We noticed slow speed with multiple paritions. – Kannaiyan Mar 21 '18 at 03:20
  • *We noticed slow speed with multiple paritions* -- Well, yes. if you don't need to query month or day, explicitly, then doing `dt=20180301`, and `hour=1130` is far more performant. – OneCricketeer Mar 21 '18 at 03:29