I've modeled data for a cricket (yes, the sport) application where there are two collections in question: sessions and deliveries. A session is a collection of deliveries with some additional metadata like location, name, etc. A session's metadata is rarely changed after creation, but a delivery will always be updated a few times after creation (to store ballTrack json, user who is the bowler, and user who is the batsman). I have currently modeled the data as sessions and deliveries being two separate collections where session is referenced by id inside of each delivery object:
session: {
_id: ...,
name: ...,
thumbnailUrl: ...,
deliveryCount: ...,
team: teamId
}
delivery: {
_id: ...,
session: sessionId,
bowler: userId,
batsman: userId,
thumbnailUrl: ...,
ballTrack: {some json},
calibration: {some json},
team: teamId, ...
}
The most important query for this application is one that aggregates all of the sessions a user is a part of on their home page. The way to identify if a user is in a session is to check if they are a bowler or batsman in a delivery in that session or if they are a team member of the team that created that session. Currently I am doing an aggregation on the deliveries collection something like this:
const matchCriteria: Array<Record<string, unknown>> = [
{ bowler: userId },
{ batsman: userId },
];
// match on all user's teams if all sessions specified
if (all) {
const userTeams: Array<TeamDocument['id']> = user.teams;
matchCriteria.push({ team: { $in: userTeams } });
}
await DeliveryModel.aggregate().match({
$or: matchCriteria,
deleted: false,
})
.group({
_id: '$session',
session: { $first: '$$ROOT.session' },
})
.lookup({
from: 'sessions',
localField: 'session',
foreignField: '_id',
as: 'session',
}) ... (some additional projections and sorting)
This is currently working fine, but as this is the most important query I would like to consider optimizing out the $lookup because I don't want to incur performance costs as the collection grows. Granted the lookup is on an indexed field, but am wondering if keeping all info in one collection might be the way to go. For example, I am considering two alternate schemas. One where deliveries are a subdocument array on the session:
session: {
_id: ...,
name: ...,
thumbnailUrl: ...,
deliveryCount: ...,
team: teamId
deliveries: [{_id: ...,
session: sessionId,
bowler: userId,
batsman: userId,
thumbnailUrl: ...,
ballTrack: {some json},
calibration: {some json},
team: teamId, ...
}
This would turn the old aggregation query into a simple find query, but I'm worried of how big the documents will become with this nesting approach. There is a cap for the number of deliveries per session (500 deliveries/session and delivery document is around 1kb in size) so I won't hit the mongoDB bson cap, but still seems like unnecessary overhead to store all of this delivery data with a session when I only need the session info for this specific query (the delivery data isn't displayed until you drill into a specific session).
The other approach would be to nest a bit of session data in each delivery. This would look something like:
delivery: {
_id: ...,
session: {_id: sessionId, name: ..., thumbnailUrl: ..., deliveryCount: ...}
bowler: userId,
batsman: userId,
thumbnailUrl: ...,
ballTrack: {some json},
calibration: {some json},
team: teamId, ...
}
This would keep all necessary info in each document, but would be duplicating session data in each document (also not sure how I'd handle updating the deliveryCount field for session with this approach).
I know this is a lot, but would appreciate some advice on the preferable schema design or if my current will scale well.