am I exposing sensitive data if I put a bson ID in a url?

Question

Say I have a Products array in my Mongodb. I'd like users to be able to see each product on their own page: http://www.mysite.com/product/12345/Widget-Wodget. Since each Product doesn't have an incremental integer ID (12345) but instead it has a BSON ID (5063a36bdeb13f7505000630), I'd need to either add the integer ID or use the BSON ID.

Since BSON ID's include the PID:

4-byte timestamp,
3-byte machine identifier,
2-byte process id,
3-byte counter.

Am I exposing secure information to the outside world if I use the BSON ID in my url?

The way I do is, I usually encode the bson id to base62, that too just for URL shortening. But as far as I know no significant issue can arise by doing so. — Sushant Gupta, Dec 05 '12 at 19:02
@SushantGupta relevant, in that case: http://stackoverflow.com/questions/6338870/how-to-implement-a-short-url-like-urls-in-twitter — jcollum, Dec 05 '12 at 19:07
Yeah, that works fine. But the way I do, I don't have to maintain a database or collection just for my url shortening. Its a quick method. By far I haven't came across any security issue as well :) — Sushant Gupta, Dec 05 '12 at 19:15
I think the machine ID and process ID are hashed anyways. As for the timestamp and counter, those aren't very sensitive. But your ObjectID's are predictable, so I wouldn't use them to hide sensitive pages. — theabraham, Dec 05 '12 at 19:22
Oh yes, I use BSON ID for pages I wish to keep public. Obviously not for private sensitive pages :) — Sushant Gupta, Dec 05 '12 at 19:25
I don't see any real security threats here, I am not sure how useful the information within the BSON id is for a person who hasn't physically hacked your computer, in which case your screwed anyway. They can't even use the machine id since it isn't publicly broadcasted through the network interface, so all information in the BSON id useless really, fair enough if it is revealing, in a small glimmer but not very much. — Sammaye, Dec 05 '12 at 20:58

guillaume · Accepted Answer · 2012-12-07T08:26:00.180

18

I can't think of any use to gain privileges on your machines, however using ObjectIds everywhere discloses a lot of information nonetheless.

By crawling your website, one could:

find about some hidden objects: for instance, if the counter part goes from 0x....b1 to 0x....b9 between times t1 and t2, one can guess ObjectIds within these invervals. However, guessing ids is most likely useless if you enforce access permissions
know the signup date of each user (not very sensitive info but better than nothing)
deduce actual (as opposed to publicly available) business hours from the timestamps of objects created by the staff
deduce in which timezones your audience lives from the timestamps of user-generated objects: if your website is one which people use mostly at lunchtime, then one could measure peaks of ObjectIds and deduce that a peak at 8 PM UTC means the audience was on the US West coast
and more generally, by crawling most of your website, one can build a timeline of the success of your service, having for any given time knowledge of: your user count, levels of user engagement, how many servers you've got, how often your servers are restarted. PID changes occurring on weekends are more likely crashes, whereas those on business days are more likely crashes + software revisions
and probably find other info specific to your business processes and domain

To be fair, even with random ids one can infer a lot. The main issue is that you need to prevent anyone from scraping a statistically significant part of your site. But if someone is determined, they'll succeed eventually, which is why providing them with all of this extra, timestamped info seems wrong.

edited Dec 07 '12 at 08:26

answered Dec 06 '12 at 09:01

guillaume

1,380
10
15

2

I don't understand about hidden objects, also you can deduce business hours by looking at what times you can contact the company, most companies should display business hours on their site freely. IDs are made server side so they are server time, not client time. How can they build a timeline from your BSON id? there is no way to predict overall what the id tells about the database or the severs. How can they tell how many times you restarted without querying every single id even the deleted ones? – Sammaye Dec 06 '12 at 09:06
I get the first point slightly but that's one that exists with an incrementing id as well and should be solved by RBAM but I still don't get the others, especially on a e-commerce site, unless everyone lies about when they are available. Also the signup date of each user on most sites is useless since sites like google, facebook, youtube and so many others show when the user signed up in true date form, so those stats are easier to scrap there. User generated objects will still have a serer-side objectid so I still don't see that. – Sammaye Dec 06 '12 at 10:03
I still don't understand the 5th point either since they would have to query every single objectid including ones that no longer exist on the system in order to understand that, otherwise they are getting inacurate results from their crawls. You can't even detect if a objectid, on it's own, is a user object or a product object so you can't just crawl all objectids, it is easier to understand if they were to just use SEO friendly urls than plain objectids. – Sammaye Dec 06 '12 at 10:11
Sorry to spam like this I am just thinking of more things, another thing is that due to the formation of the objectid it isn't sequential which means that unlike an auto incrementing id you cannot predict the next _id in the chain reliably which means crawling by object id for a particular object (i.e. user or page_click) is quite difficult, if not very doable. – Sammaye Dec 06 '12 at 10:17
1

(1) I wouldn't compare ObjectIds vs. incrementing ids but rather ObjectIds vs. random ids which disclose less info (2) Not all sites publish signup dates. If yours doesn't, it would be a mistake to use ObjectIds as user ids (3) About business hours, I tried to think of clever uses but I couldn't. Maybe you're right (5) Statistics allow to draw conclusions from a limited sample, I don't understand your objection, especially about deleted objects. If crawling your site gives a large number of ids from URLs or the HTML, then you can graph user growth, server counts and restarts accurately enough – guillaume Dec 06 '12 at 11:01
Well a random id still time based so you still have the same problem. You could encode the string but that just gives an extra layer really, no real defense against scraping. Yea true if you don't wanna display user sign up date it could have an effect there, about 5 I guess it just comes from the fact that I can't see how some one would go abot crawling the site in a manner that would give meaningful results. – Sammaye Dec 06 '12 at 11:06
I can understand server restarts since you can do that with normal spidering, but I can't grasp how the other statistics could be gathered, it seems really out of reach of non-contextual crawl bots. – Sammaye Dec 06 '12 at 11:15
My definition of a random id is it is made by a cryptographically-secure [PRNG](https://en.wikipedia.org/wiki/PRNG) and then it is checked for uniqueness. It is not time-based as you say. I still don't understand your objection about (5). [Confidence intervals](https://en.wikipedia.org/wiki/Confidence_interval) show that you can determine your margins of error. Once you've crawled enough pages, you reach accurate results. Then you can guess some revenues from user count and engagement, and some expenses from server counts – guillaume Dec 06 '12 at 11:21
"how the stats could be gathered?" Crawl a large part of the site. You obtain thousands of URLs such as user profiles (eg /user/abcd1234), user-generated content (eg /posts/dcba4321), form data (reply to comment ef567890), etc. Since these ids are all timestamped and typed by the URL/form (such id is a user, such is a post, etc.), you can graph each type through time. I'll admit that posts and comments are always timestamped anyway but not users and some other classes of objects. – guillaume Dec 06 '12 at 11:34
Hmm the average unique id is time based for a reason due to relative uniqueness without having to query it but ok, anyway thinking about it more you can predict how much traffic a site might have got but but I definitely don't see how you can get server count from the _id, considering that the _id is made in the app so the server count won't be realiable, i.e. you could have one app server and 7 db severs. But yea you could graph the amount of top level entites someone might have. – Sammaye Dec 06 '12 at 11:36
"anyway thinking about it more you can predict how much traffic a site might have got" In comparison to your unique id. – Sammaye Dec 06 '12 at 12:00
To get around these issues you'd need to come up with a highly random way to make object ids that is also unique across multiple databases. On the other hand, you're exposing _some_ information to a determined attacker (?). The perils of the first part seem to outweigh the perils of the second part. – jcollum Dec 06 '12 at 16:41
@jcollum: "highly random way to make object ids that is also unique across multiple databases": random is the job of /dev/urandom and unique is easy if this id is a sharding key. I agree is it impossible not to expose any information, but I understood your question as a puzzle – guillaume Dec 06 '12 at 21:01
ObjectIDs are normally generated by the client driver rather than the server .. so your inferences on server restarts and usefulness of PIDs are incorrect. The use case described is also product URLs rather than user IDs, so there is generally less interesting behaviour to infer. If someone can access a "hidden" resource because they can guess or find an ID, the fundamental problem that the resource is not properly secured ;-). – Stennie Dec 07 '12 at 00:29
@Stennie: The client driver runs on a web/API/whatever-you-call-it server, which is useful to be able to count. Plus, if you insert an object without an id, I'm not sure which one of the client driver or mongod generates it, but it may be mongod, I haven't checked. I agree product URLs are not that useful but once you publish these, it is tempting to use ObjectIds everywhere. Agreed on blocking access to resources – guillaume Dec 07 '12 at 08:17
The client API generates the objectid not mongod, it is designed specifically to create a high level of randomness without needing to have the over head of checking its uniqueness against the database which will crawl inserts on a system. – Sammaye Dec 07 '12 at 08:28

score 2 · Answer 2 · answered Dec 05 '12 at 21:03

Sharing the information in the ObjectID will not compromise your security. Someone could infer minor details such as when the ObjectID was created (timestamp), but none of the ObjectID components should be tied to authentication or authorization.

If you are building an e-commerce site, SEO is typically a strong consideration for public URLs. In this case you normally want to use a friendlier URL with shorter and more semantic path components than an ObjectID.

Note that you do not have to use the default ObjectID for your _id field .. so could always generate something more relevant for your application. The default ObjectID does provide a reasonable guarantee of uniqueness, so if you implement your own _id allocation you will have to take this into consideration.

am I exposing sensitive data if I put a bson ID in a url?

3 Answers3