2

I am trying to understand Data Lakes, and most examples show only simple use cases. What I want to understand is effectively 'join queries'.

For example, I have files with product data (uploaded to S3-Product-Data) and a database with product annual sales (uploaded to S3-Product-Sales). How does AWS Lakes / Athena make a query that runs across these two environments?

Of course something will need to link them. I am just not sure what the query would look like, nor what Athena does under the covers to merge the data (and be performant).

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
Steve M
  • 89
  • 1
  • 2
  • 9
  • If you have defined two tables in Amazon Athena, you can `JOIN` them with regular SQL syntax. What specific difficulty are you having? – John Rotenstein Jul 12 '19 at 03:52

2 Answers2

2

What you do is create a table in Athena that references the files with product data, and another table that references the files with annual sales. After that you can run SQL that combines the tables.

Exactly how the SQL would look depends on your data, what columns it has, etc. If your product data has a product_id column, and your sales data does too, you can join them like this (the column names are of course all made up):

SELECT product_name, SUM(sales.sold_for) AS total_revenue
FROM products
LEFT JOIN sales USING (product_id)
Theo
  • 131,503
  • 21
  • 160
  • 205
  • Seems pretty straightforward. But in scenario above, how do Athena actually manage the activity - some data is in files, other data is in a database. So when join initially takes place does it need to unify all data in its own database? And does it then do this every time the query is made? – Steve M Jul 13 '19 at 07:20
  • You said your database was uploaded to S3, so it's just files on S3, right? – Theo Jul 14 '19 at 08:10
0

For the data you have in database please extract the data in csv file and put it into another s3 bucket. Then configure the glue crawler to crawl on this s3 bucket and let it create a table. You can then join you already existing athena table and the new table.