I am looking to create an inference pipeline using AWS SageMaker, similar to shown here. However, the key difference is our data is stored across multiple normalised tables and too large to fit in reasonable memory.
For example, the ETL'd data sits in an S3 bucket in multiple tables:
- customer: Contains customer details such as age and sex.
customer_id
is the primary key. - customer_sales: contains customer sales,
customer_id
is a secondary key and customers have multiple sales, sometimes hundreds per customer. - customer_other: some other information about customers,
customer_id
is a secondary key.
Each table is partitioned into multiple files, and customers can appears across all of the partitions.
If the tables all fitted in memory, I know how to approach this task as shown in the example link. However, what infrastructure would be typical here to make an inference pipeline for the customers in the out of memory case? The data isn't huge - 10s of Gbs not hundreds, but still large enough that we would rather run the pipeline with every data point in memory.