1

I am using Python Mongoengine for inserting image files into GridFS, with the following method:

product = Product(name='New Product', price=20.0, ...)
with open(<IMAGE_FILE>, 'rb') as product_photo:
    product.image.put(product_photo_main, content_type='image/jpeg')
product.save()

When I view this data with NoSQLBooster (or anything else) the data is represented like so:

{
    "_id" : ObjectId("5d71263eae9a187374359927"),
    "files_id" : ObjectId("5d71263eae9a187374359926"),
    "n" : 0,
    "data" : BinData(0,"/9j/4AAQSkZJRgABAQEASABIAAD/4V6T...  more 261096 bytes - image/jpeg")
},

And knowing that the second part of the tuple in BinData of the "data" field contains base64 encoding, I'm confused at which point the raw bytes given by open(<IMAGE_FILE>, 'rb') becomes encoded with base64?

So further more, being that base64 encoding is 33% - 37% larger in its size, in regards of transferring that data - this is bad, how can I choose the encoding? At least stop it from using base64...

I have found this SO question which mentions a HexData data type.

I also found others mentioning subtypes aswell, which led me to find this about BSON data types.

Binary
Canonical Relaxed
{ "$binary":
   {
      "base64": "<payload>",
      "subtype": "<t>"
   }
}
<Same as Canonical>
Where the values are as follows:
"<payload>"
Base64 encoded (with padding as “=”) payload string.
"<t>"
A one- or two-character hex string that corresponds to a BSON binary subtype. See the extended bson documentation

http://bsonspec.org/spec.html for subtypes available.

Which clearly tells us the payload will be base64!

So can I change this, or does it have to be that way?

Jamie Lindsey
  • 928
  • 14
  • 26

1 Answers1

0

at which point the raw bytes ... becomes encoded with base64

Direct Answer

Only at the point where you chose to display them on your console or through some other "display" format. The native format that crosses the wire in BSON format won't have this issue.

If you choose not to display the contents to your terminal or debugger, it will never have been encoded to base64 or any other format.

Point of Correction

which led me to find this about BSON data types.

Which clearly tells us the payload will be base64!

The linked page is referring to MongoDB Extended JSON, not the wire BSON format.

It is true that Extended JSON encodes the binary to base64, that is not true about bson itself.

As below, the only time your driver will pass the data through the extended JSON conversion is at the moment you ask it to display the contents to you via a print or debug

Details

BSON's Spec (the internal mongodb serialization format) binaries are native byte format.

The relevant portion of the spec:

binary  ::=     int32 subtype (byte*)

indicates that a binary object is

  1. length of the byte*,
  2. followed by a 1-byte subtype
  3. followed by the raw bytes

in the case of the bytes "Hello\x00World" which includes a null byte right in the middle

the "wire format" would be

[11] [0x00] [Hello\x00World]

notice, stack overflow, like virtually every driver or display terminal struggles with the embedded null byte, as would just about every display terminal unless the system made evident that the null byte is actually included in the bytes to be displayed.

meaning the integer (packed into a 32bit byte) followed by 1byte subtype, followed by the literal bytes is what will actually cross the wire.

As you pointed out, most languages would have immense trouble rendering this onscreen to a user.

Extended JSON is the specification that involves the most appropriate way to render non-displayable data into drivers.

Object IDs aren't just bytes, they're objects that can represent timestamps.

Timestamps aren't just numbers, they can represent timezones and be converted to display against the user timezone.

Binaries aren't always text, may have problematic bytes in there, and the easiest way to not bork up your terminal/gui/debugger is to simply encode them away in some ASCII format like base64.

Keep in Mind

bson.Binary and GridFS are not really supposed to be displayed/printed/written in their wire format. The wire format exists for the transfer layer.

To ease with debugging and print statements, most drivers implement a easily "displayable" format that yanks the native BSON format through the Extended JSON spec.

If you simply choose not to display/encode as extend JSON/debug/print, the binary bytes will never actually be base64 encoded by the driver.

Community
  • 1
  • 1
bauman.space
  • 1,993
  • 13
  • 15