MongoDB - Inconsistent representation of binary data using C++ and Java drivers

Question

I need to keep some binary data in my MongoDB collection. It seems that I'm getting different JSON representation of my documents when retrieving the same record using either the C++ driver or the Java driver. Here is an example. Insert three record in MongoDB collection using Mongo shell:

db.binary_test.insert({"name":"Alex", "data" :BinData("0x00", "12345678")})
db.binary_test.insert({"name":"Alex", "data" :BinData("0x80", "12345678")})

The first record uses binary type 0x00 (generic); the second - 0x80 (user defined).

Retrieve these record using Mongo Shell:

db.binary_test.find().pretty()

Output:

{
    "_id" : ObjectId("51acf66886174308b610d950"),
    "name" : "Alex",
    "data" : BinData(0,"12345678")
}
{
    "_id" : ObjectId("51acf66c86174308b610d951"),
    "name" : "Alex",
    "data" : BinData(128,"12345678")
}

Note that the tag is represented as a number, not as a hex-string.

Now retrieve same records using a very simple Java program and convert them to JSON using the strict serializer:

ObjectSerializer serializer = JSONSerializers.getStrict();
System.out.println(serializer.serialize(doc));

Here is the output:

{ "_id" : { "$oid" : "51acf66886174308b610d950"} , "name" : "Alex" , "data" : { "$binary" : "12345678" , "$type" : 0}}
{ "_id" : { "$oid" : "51acf66c86174308b610d951"} , "name" : "Alex" , "data" : { "$binary" : "12345678" , "$type" : -128}}

Note that the binary data type is represented as an integer, not a hex-string.

Now for comparison use MongoDB C++ driver to retrieve the same two records and print them using the jsonString() method. Here is the output:

{ "_id" : { "$oid" : "51acf66886174308b610d950" }, "name" : "Alex", "data" : { "$binary" : "12345678", "$type" : "00" } }
{ "_id" : { "$oid" : "51acf66c86174308b610d951" }, "name" : "Alex", "data" : { "$binary" : "12345678", "$type" : "80" } }

Now the type is a hex-string, not a number.

So the same record has different JSON representations depending on whether it was retrieved using the C++ driver or the Java driver. This discrepancy creates problems in mixed environments when some software uses the Java driver and some uses the C++ driver. Any suggestions how to solve the problem (other than by changing the driver code)? And which one is correct - the C++ driver that represents the type as a hex-string, or the Java driver? My understanding is that the representation returned by the C++ driver is correct, but can someone confirm this?

MongoDB http interface also returns the hex-string representation - probably because the backend that supports REST interface (mongod) is written in C++.

I'm using Java driver version 2.11.1 and C++ driver version 2.4.3.

have you seen this page? http://docs.mongodb.org/manual/reference/mongodb-extended-json/ — Asya Kamsky, Jun 04 '13 at 00:10
According to these specs, in strict mode the type must be a quoted string, so the representation provided by the Java Driver does not match the specs. — Alex Foygel, Jun 05 '13 at 18:55

score 0 · Answer 1 · answered Jun 04 '13 at 14:47

There is no difference here. The data is the same, only the formatters that make it human readable present it in a different format.

[...] print them using the jsonString() method

This is the point, your looking at formatted output: "0x80", "80", 128 and -128 can all mean the same thing. Data is always interpreted according to some convention. In the case of "0x80", the "0x"-prefix is a somewhat widespread convention for indicating hexadecimal notation. "80" requires you to know that the data shall be interpreted as hex string and corresponds to a binary value of 1000000, which, for an integer is equal to the decimal value of 128. If the value is interpreted as byte, however, it corresponds to -128.

Your code should not look at the formatted output, but compare to a field or property that has a well-defined type such as int. Then, you can write

if(a.Type == 128) { ... }

Which should evaluate to true for the correct value, no matter what the formatter of your programming language outputs. (If the a.Type were byte, you'd have to compare to -128. Most programming languages would issue a warning or an error if you compared to 128, because 128 is larger than the largest representable signed 8-bit value, aka byte).

By the way, the different formattings are even more striking when you look at the object id's representation: in the mongo console, it is presented as ObjectId("51acf66886174308b610d950"), the C++ json formatter displays "_id" : {"$oid" : "51acf66886174308b610d950"}. Again, it's the same data, but here the string looks the same while the scaffold around it looks different.

The problem is it's not my code that is reading the formatted output, but rather C++ mongo::fromjson() method that is part of the C++ MongoDB driver. There is no "my code", so to speak, between the java driver getting documents from the data, and the C++ program getting these documents as provided by the Java driver — Alex Foygel, Jun 05 '13 at 17:39
It's only JSON -- you still have to interpret the data yourself. String or not string - it doesn't matter. It's like comparing the visual display of two hex viewers that show the same content in different format. You shouldn't need to meddle with this. Why are you converting to JSON in the first place? What happens if you ask the drivers for a fully hydrated POCO/POJO that contains a `byte[]`? — mnemosyn, Jun 16 '13 at 11:18

score 0 · Answer 2 · answered Jun 06 '13 at 22:29

0

This is a bug in the Java driver.

answered Jun 06 '13 at 22:29

Bernie Hackett

8,749
1
27
20

2

That bug report was posted by the OP. – interjay Jun 06 '13 at 22:54

MongoDB - Inconsistent representation of binary data using C++ and Java drivers

2 Answers2