How can I have multiple partitions based on different parameters for my data in data lake

Question

We're building a new Data Lake for a huge amount of data from various data sources, storing the data in Parquet format in Amazon S3 buckets.

We're currently creating the partitions based on a particular field (e.g., Record-Creation-Time). So we're good as long as our queries to the data lake are based on this particular field.

But now we have a need to query the same data based on a few other fields as well (e.g., Last-Updated-Time, Transaction-Time etc). We're wondering how we can do this without having to duplicate the data and store them in partitions based on the other parameters (Last-Updated-Time, Transaction-Time etc)

I'm sure this is a common problem and there would be existing approaches to solve the same, but I couldn't find much information so far. I'm hoping that the data experts at StackOverflow can help me out here by suggesting the right way to store the data in Data Lake so that I can query it using various parameters.

I've looked up many Big Data related blogs/sites to find help, but haven't found anything specific to my query.

Have you implemented anything so far and measured performance? It might be worth doing a Proof of Concept to try the desired technology (be it Amazon Athena, Amazon Redshift Spectrum, Apache Spark) and identify performance bottlenecks so you have a performance baseline. Depending how you use the data, it could be worth loading into transient Amazon Redshift clusters for improved performance while keep the Data Lake as the definitive source of data. — John Rotenstein, May 11 '19 at 09:39
Thanks for your comments @JohnRotenstein. We're using Apache Spark to read the data from the Data Lake and push it into a Data Warehouse. We already have this implemented. Our Spark application currently queries the data based on the **Record-Creation-Time**. This performs fine since the data is already partitioned based on this field. The performance problems are with queries based on other fields. — user2869520, May 12 '19 at 10:36

How can I have multiple partitions based on different parameters for my data in data lake

0 Answers0