I want to be able to archive my data from Mongo into S3. Currently, what I do is
- Read data from Mongo
- Convert this into a pyarrow Table
- Write to S3
It works for now, but steps 1 and 2 is kind of a bulk thing where if the result set is huge it needs to somehow fit into memory? I could do smaller chunks but I think that will mean I will get smaller files parquet files too which is not performant? Any best practice for this?
audits_collection: AsyncIOMotorCollection = db.get_collection("audits")
results = await audits_collection.find({
"eventOccurredAt": {
"$gte": from_date,
"$lt": to_date,
},
}).to_list(None) # OK for now but thinking this is not scalable if it gets too large
transformed = json.loads(json.dumps(results, default=str)) # Want to parse things like ObjectIds into string
table = pa.Table.from_pylist(transformed, schema=schema)
s3 = fs.S3FileSystem()
pq.write_to_dataset(
table,
root_path="bucket-name/archive",
partition_cols=["date", "hour"],
existing_data_behavior="delete_matching",
filesystem=s3,
)