Normalizing in Mongo simply to reduce file size/increase performance?

Question

Working with MongoDB, I'm fairly aware of when normalization makes sense, but I'm trying to understand if it makes sense from a data-size or query-performance sense.

As an example, let's say I have a list of 2,000 retail stores. For the example, I do not care about anything else so I only have the store name. At least a few characters but never longer than a hundred.

I have 20 million records of people who took a trip to a store. I'd record some info about the trip, along with this store name.

In MySQL, the file size of the table/records can impact the query performance quite heavily. Normalizing the store names so that I can store the string once, and then write the key (an int) instead saves on a lot of disk space. That turns into performance gains.

In MongoDB, this type of data is small and unimportant enough that it should simply be embedded. However, I'm concerned that after 20 million records, queries and/or file size will suffer performance problems because of how much redundant data is there.

If 5 million people all went to a single store, there's going to be 5 million embedded copies of that string in mongo documents. That will take up a lot more disk space and seems like it may eat up performance over an integer.

I suppose if I did normalize the data, Mongo would still have some performance overhead to query that additional data.

score 1 · Answer 1 · edited May 23 '17 at 11:56

If 5 million people all went to a single store, there's going to be 5 million embedded copies of that string in mongo documents.

Yeah, and much worse, should the name ever change, you'd have to update 5M entries

That will take up a lot more disk space and seems like it may eat up performance over an integer.

I'd abstain from integers as keys. Integers create bottlenecks if sequenced (the sequencer becoming the bottleneck) and non-monotonous if using algorithms like HiLo. Use ObjectId or something else that is non-sequential, but monotonic.

Monotony often improves data locality (if creation time is a locality criterion, which it often is), and helps utilize the BTree more efficiently.

I have 20 million records of people who took a trip to a store. I'd record some info about the trip, along with this store name.

That's a typical graph problem. People and Stores are nodes, trips are the edges. MongoDB handles normalized graphs better than one might think.

Directly comparing the performance of the two approaches is tricky - the additional roundtrip will make the normalized query slower, but you could choose to additionally store the store's name with each trip. The performance overhead won't be huge, but it will add RAM pressure.

Summing up, adjusting the data structure for less queries has certainly merit, but for sheer size reduction from, say, 80 bytes on average to 24 bytes I wouldn't do it. The normalized structure is much easier to keep consistent (or eventually consistent, when storing both), easy enough to query and can be efficiently 'joined' using $in. The upside is clear data ownership and more flexibility (e.g. reversing the query to 'who ever took a trip to place x') is much easier.

In my specific case, I will never be updating any of these records. They are recorded as-is until the day they're deleted. The only exception might be changing a bit from 0 1 to indicate a bool for something but that's all. — helion3, Mar 11 '15 at 22:48
Can you clarify what you mean by "additionally store the name each trip"? Because this data will never be modified, I'm primarily only after what will allow me the best query performance (ignoring the indexing etc). If an ObjectID is smaller than most of the strings I might store, is it worth it to do so - keeping in mind I'd also now have to query for the real value, *and* lookup the proper ObjectID when I write every record. — helion3, Mar 11 '15 at 23:02
If it's *solely* about query performance, de-normalization is fastest, unless you patterns in your data that could be exploited only through normalization. — mnemosyn, Mar 11 '15 at 23:03
What do you mean by patterns exploited only by normalization? — helion3, Mar 11 '15 at 23:06

score 0 · Answer 2 · answered Mar 11 '15 at 22:23

0

the question of normalizing depends how you are accessing the data 95% of the time. e.g. if you have a site or app that lists the stores first and then based on a request shows the trips to the store then definitely Normalize

adding a index to the Store attribute in your collection would enhance the performance of your searches whether or not its a int

assuming you have done that, from what I have read the document size does not affect search of the document as it would use the index to find the rows with the referenced Store

answered Mar 11 '15 at 22:23

Petrus Prinsloo

286
5
10

The records are always displayed together, but we'll constantly be querying for a variety of things, mainly recent trips in a recent timeframe, but also a trips to a specific store within a small time frame, a certain customer's trips, etc. – helion3 Mar 11 '15 at 22:57
it seems you are calling db.trips.find({Store:x}) most of the time in which case I would normalize and index this field – Petrus Prinsloo Mar 11 '15 at 23:32

Normalizing in Mongo simply to reduce file size/increase performance?

2 Answers2