2

I am trying to calculate the exact size of the protocol buffer objects.

I went through the following links: How do I determine the size of an object in Python? and https://goshippo.com/blog/measure-real-size-any-python-object/

But the protocol buffer objects do not include dict in dir(object) since it can cause corruption by people trying to manually add parameters to it. This is based on my understanding, though it might not be complete or correct.

So, I started with this protocol buffer message definition

syntax = "proto2";

package test;

message Inner {
  optional bytes inner_id = 1;
  optional string inner_name = 2;
  optional int64 inner_value = 3;
}

message Outer {
  optional bytes uuid = 1;
  optional string name = 2;
  enum Test {
    kOne = 1;
    kTwo = 2;
  }
  optional Test testing = 3;
  repeated Inner inner_list = 4;
}

This is sample usage

import uuid
from test_pb2 import Inner, Outer

x = Outer()
x.uuid = uuid.uuid4().bytes
x.name = "test"
x.testing = Outer.kOne
x.inner_list.add(inner_id=uuid.uuid4().bytes, inner_name="ok1", inner_value=1)
x.inner_list.add(inner_id=uuid.uuid4().bytes, inner_name="ok2", inner_value=2)
x.inner_list.add(inner_id=uuid.uuid4().bytes, inner_name="ok3", inner_value=3)

print id(x.inner_list)
print id(x.inner_list[0].inner_id)
print id(x.inner_list[1].inner_id)
print id(x.inner_list[2].inner_id)
print id(x.inner_list[0].inner_name)
print id(x.inner_list[1].inner_name)
print id(x.inner_list[2].inner_name)
print id(x.inner_list[0].inner_value)
print id(x.inner_list[1].inner_value)
print id(x.inner_list[2].inner_value)

The id of inner_id, inner_name and inner_value is the same even though they belong to a different list and have different values.

So, the modification of code in above link did not work as expected

def get_size(obj, seen=None):
    """Recursively finds size of objects"""
    size = sys.getsizeof(obj)
    if seen is None:
        seen = set()
    obj_id = id(obj)
    if obj_id in seen:
        return 0
    # Important mark as seen *before* entering recursion to gracefully handle
    # self-referential objects
    seen.add(obj_id)
    if isinstance(obj, dict):
        size += sum([get_size(v, seen) for v in obj.values()])
        size += sum([get_size(k, seen) for k in obj.keys()])
    elif hasattr(obj, '__dict__'):
        size += get_size(obj.__dict__, seen)
    elif hasattr(obj, '__iter__') and not isinstance(obj, (str, bytes, bytearray)):
        size += sum([get_size(i, seen) for i in obj])
    else:
        try:
            for desc, _ in obj.ListFields():
                if desc.label == FieldDescriptor.LABEL_REPEATED:
                    size += sum([get_size(i, seen) for i in getattr(obj, desc.name)])
                else:
                    size += get_size(getattr(x, desc.name), seen)
        except Exception as ex:
            pass
    return size

Since it tripped at the id check (obj_id in seen) and did not account for the different memory requirement between "ok1" and "ok2" for example

Could anyone please explain the reason between the same "ids" and how to correctly calculate the size of protocol buffers?

Thanks in advance.

likecs
  • 353
  • 2
  • 13

1 Answers1

2

I think this is simpler than you expected. Outer is a Message, and all messages have a ByteSize function.

from google.protobuf.message import Message

message: Message = Outer()
size_in_bytes = message.ByteSize()
print(f"Message is {size_in_bytes} bytes.")

If you add type hints and use an IDE, it should suggest a few useful functions on message.

Ben Butterworth
  • 22,056
  • 10
  • 114
  • 167