0

I am looking to create an inference pipeline using AWS SageMaker, similar to shown here. However, the key difference is our data is stored across multiple normalised tables and too large to fit in reasonable memory.

For example, the ETL'd data sits in an S3 bucket in multiple tables:

  • customer: Contains customer details such as age and sex. customer_id is the primary key.
  • customer_sales: contains customer sales, customer_id is a secondary key and customers have multiple sales, sometimes hundreds per customer.
  • customer_other: some other information about customers, customer_id is a secondary key.

Each table is partitioned into multiple files, and customers can appears across all of the partitions.

If the tables all fitted in memory, I know how to approach this task as shown in the example link. However, what infrastructure would be typical here to make an inference pipeline for the customers in the out of memory case? The data isn't huge - 10s of Gbs not hundreds, but still large enough that we would rather run the pipeline with every data point in memory.

FChm
  • 2,515
  • 1
  • 17
  • 37
  • 1
    Are you running into OOM errors? I believe with larger instances, you should be able to preprocess/train your model on the entire dataset. – durga_sury May 09 '23 at 22:41
  • Thanks for your comment - yeah we can process it with a large instance, but their is an internal desire to build it OOM so we can scale it... – FChm May 10 '23 at 10:00

0 Answers0