We're building a new Data Lake for a huge amount of data from various data sources, storing the data in Parquet format in Amazon S3 buckets.
We're currently creating the partitions based on a particular field (e.g., Record-Creation-Time). So we're good as long as our queries to the data lake are based on this particular field.
But now we have a need to query the same data based on a few other fields as well (e.g., Last-Updated-Time, Transaction-Time etc). We're wondering how we can do this without having to duplicate the data and store them in partitions based on the other parameters (Last-Updated-Time, Transaction-Time etc)
I'm sure this is a common problem and there would be existing approaches to solve the same, but I couldn't find much information so far. I'm hoping that the data experts at StackOverflow can help me out here by suggesting the right way to store the data in Data Lake so that I can query it using various parameters.
I've looked up many Big Data related blogs/sites to find help, but haven't found anything specific to my query.