15

Pointers cannot be persisted directly to file, because they point to absolute addresses. To address this issue I wrote a relative_ptr template that holds an offset instead of an absolute address.

Based on the fact that only trivially copyable types can be safely copied bit-by-bit, I made the assumption that this type needed to be trivially copyable to be safely persisted in a memory-mapped file and retrieved later on.

This restriction turned out to be a bit problematic, because the compiler generated copy constructor does not behave in a meaningful way. I found nothing that forbid me from defaulting the copy constructor and making it private, so I made it private to avoid accidental copies that would lead to undefined behaviour.

Later on, I found boost::interprocess::offset_ptr whose creation was driven by the same needs. However, it turns out that offset_ptr is not trivially copyable because it implements its own custom copy constructor.

Is my assumption that the smart pointer needs to be trivially copyable to be persisted safely wrong?

If there's no such restriction, I wonder if I can safely do the following as well. If not, exactly what are the requirements a type must fulfill to be usable in the scenario I described above?

struct base {
    int x;
    virtual void f() = 0;
    virtual ~base() {} // virtual members!
};

struct derived : virtual base {
    int x;
    void f() { std::cout << x; }
};

using namespace boost::interprocess;

void persist() {
    file_mapping file("blah");
    mapped_region region(file, read_write, 128, sizeof(derived));
    // create object on a memory-mapped file
    derived* d = new (region.get_address()) derived();
    d.x = 42;
    d->f();
    region.flush();
}

void retrieve() {
    file_mapping file("blah");
    mapped_region region(file, read_write, 128, sizeof(derived));
    derived* d = region.get_address();
    d->f();
}

int main() {
    persist();
    retrieve();
}

Thanks to all those that provided alternatives. It's unlikely that I will be using something else any time soon, because as I explained, I already have a working solution. And as you can see from the use of question marks above, I'm really interested in knowing why Boost can get away without a trivially copyable type, and how far can you go with it: it's quite obvious that classes with virtual members will not work, but where do you draw the line?

Jonas
  • 121,568
  • 97
  • 310
  • 388
R. Martinho Fernandes
  • 228,013
  • 71
  • 433
  • 510

6 Answers6

8

To avoid confusion let me restate the problem.

You want to create an object in mapped memory in such a way that after the application is closed and reopened the file can be mapped once again and object used without further deserialization.

POD is kind of a red herring for what you are trying to do. You don't need to be binary copyable (what POD means); you need to be address-independent.

Address-independence requires you to:

  • avoid all absolute pointers.
  • only use offset pointers to addresses within the mapped memory.

There are a few correlaries that follow from these rules.

  • You can't use virtual anything. C++ virtual functions are implemented with a hidden vtable pointer in the class instance. The vtable pointer is an absolute pointer over which you don't have any control.
  • You need to be very careful about the other C++ objects your address-independent objects use. Basically everything in the standard library may break if you use them. Even if they don't use new they may use virtual functions internally, or just store the address of a pointer.
  • You can't store references in the address-independent objects. Reference members are just syntactic sugar over absolute pointers.

Inheritance is still possible but of limited usefulness since virtual is outlawed.

Any and all constructors / destructors are fine as long as the above rules are followed.

Even Boost.Interprocess isn't a perfect fit for what you're trying to do. Boost.Interprocess also needs to manage shared access to the objects, whereas you can assume that you're only one messing with the memory.

In the end it may be simpler / saner to just use Google Protobufs and conventional serialization.

Joseph Nields
  • 5,527
  • 2
  • 32
  • 48
deft_code
  • 57,255
  • 29
  • 141
  • 224
  • You say it doesn't "need to be binary copyable". Does that mean that I can use mmaped files to cheat around not being able to use memcpy? http://ideone.com/TK4SF – R. Martinho Fernandes Sep 08 '11 at 18:39
  • Your example does not make a copy. It aliases the object. If you change `x.foo` it will also change `y.foo`. They are the same object even though they have different addresses due the the mmap. – deft_code Sep 08 '11 at 20:30
  • Technically they're not the same object. The identity of a C++ object is defined by its address. Likely in all implementations of mmapped files changing one will change the other, because both regions will map to the same underlying page. However, were that not to happen, I would indeed be cheating thanks to implementation specifics, I think. Anyway, I think I got my answer now: there are no POD-like requirements, only the address independence requirements I was already handling :) – R. Martinho Fernandes Sep 08 '11 at 23:34
  • 1
    No, they are the same object. They may have different addresses, but the example used a shared memory mapping. Because of that, they must be aliases of each other; changes in x must show in y. No implementation can legitimately do otherwise. Additionally, you must declare foo as volatile, otherwise the compiler may make optimizations that don't take into account the aliasing. Lastly, although not_trivially_copyable has a constructor, it is never the less trivially copyable. The idea of trivially copyable has to do with whether a bitwise copy of the object will work correctly. – James Caccese Dec 30 '11 at 19:56
  • For more on trivially copyable, see http://groups.google.com/group/comp.lang.c++.moderated/msg/bbfc2a3d9b1665f3 – James Caccese Dec 30 '11 at 19:57
4

Yes, but for reasons other than the ones that seem to concern you.

You've got virtual functions and a virtual base class. These lead to a host of pointers created behind your back by the compiler. You can't turn them into offsets or anything else.

If you want to do this style of persistence, you need to eschew 'virtual'. After that, it's all a matter of the semantics. Really, just pretend you were doing this in C.

bmargulies
  • 97,814
  • 39
  • 186
  • 310
2

Even PoD has pitfalls if you are interested in interoperating across different systems or across time.

You might look at Google Protocol Buffers for a way to do this in a portable fashion.

Steve Dispensa
  • 237
  • 1
  • 2
  • Thanks. But in this case, the file will never leave this machine, so, that's not an issue. – R. Martinho Fernandes Sep 04 '11 at 19:17
  • 1
    Will you ever upgrade the machine? Will you ever switch compilers? Will you ever change SDKs in such a way that your structure packing changes? Compiler flags? Lots can go wrong here. – Steve Dispensa Sep 04 '11 at 19:32
  • Don't worry, the file is local, temporary, and not shared. And it's not a problem to make future versions incompatible. – R. Martinho Fernandes Sep 04 '11 at 19:37
  • @R. Martinho Fernandes, you should mention this in your original question. – DuckMaestro Sep 04 '11 at 20:12
  • 1
    @Duck: perhaps, but it's not relevant to the question. He asked what the requirements are for persisting an object to a memory-mapped file. Portability and compiler upgrades and protocol buffers had nothing to do with it. – jalf Sep 04 '11 at 20:33
  • 1
    Not to sound like sour grapes, but a downvote? He didn't specify that the file would never be moved, and the implicit assumption in the question is clearly that POD is safe, which in the cases I mentioned is not true. I provided an alternative to POD that *is* as safe as the question assumes. "Do I need to make a type POD..." implies that the questioner is missing an important point, at least without the clarification that DuckMaestro asked for. – Steve Dispensa Sep 05 '11 at 02:34
2

Not as much an answer as a comment that grew too big:

I think it's going to depend on how much safety you're willing to trade for speed/ease of usage. In the case where you have a struct like this:

struct S { char c; double d; };

You have to consider padding and the fact that some architectures might not allow you to access a double unless it is aligned on a proper memory address. Adding accessor functions and fixing the padding tackles this and the structure is still memcpy-able, but now we're entering territory where we're not really gaining much of a benefit from using a memory mapped file.

Since it seems like you'll only be using this locally and in a fixed setup, relaxing the requirements a little seems OK, so we're back to using the above struct normally. Now does the function have to be trivially copyable? I don't necessarily think so, consider this (probably broken) class:

   1 #include <iostream>
   2 #include <utility>
   3 
   4 enum Endian { LittleEndian, BigEndian };
   5 template<typename T, Endian e> struct PV {
   6         union {
   7                 unsigned char b[sizeof(T)];
   8                 T x;
   9         } val;  
  10         
  11         template<Endian oe> PV& operator=(const PV<T,oe>& rhs) {
  12                 val.x = rhs.val.x;
  13                 if (e != oe) {
  14                         for(size_t b = 0; b < sizeof(T) / 2; b++) {
  15                                 std::swap(val.b[sizeof(T)-1-b], val.b[b]);
  16                         }       
  17                 }       
  18                 return *this;
  19         }       
  20 };      

It's not trivially copyable and you can't just use memcpy to move it around in general, but I don't see anything immediately wrong with using a class like this in the context of a memory mapped file (especially not if the file matches the native byte order).

Update:
Where do you draw the line?

I think a decent rule of thumb is: if the equivalent C code is acceptable and C++ is just being used as a convenience, to enforce type-safety, or proper access it should be fine.

That would make boost::interprocess::offset_ptr OK since it's just a helpful wrapper around a ptrdiff_t with special semantic rules. In the same vein struct PV above would be OK as it's just meant to byte swap automatically, though like in C you have to be careful to keep track of the byte order and assume that the structure can be trivially copied. Virtual functions wouldn't be OK as the C equivalent, function pointers in the structure, wouldn't work. However something like the following (untested) code would again be OK:

struct Foo { 
    unsigned char obj_type;
    void vfunc1(int arg0) { vtables[obj_type].vfunc1(this, arg0); }
};
user786653
  • 29,780
  • 4
  • 43
  • 53
  • Actually, your example is still trivially copyable. The template does not prevent the compiler from generating the default copy assignment operator. In fact, the default operator will always win overload resolution. http://www.ideone.com/qMdED. But I get your point. Also, thanks for answering the questions I *posed* :) – R. Martinho Fernandes Sep 05 '11 at 09:37
  • Oh, I almost forgot (and off-topic): don't tell anyone, even if it works pretty much everywhere, type punning with a union is undefined behaviour ;) – R. Martinho Fernandes Sep 05 '11 at 10:32
  • It was just meant as a quick example not meant for production code, imagine it has a copy constructor implemented in terms of the assignment operator :) The point I was trying to make is than in most cases where you'd be fine with having special rules governing the access of the mmaped value using C++ to the fullest to express the intent is OK, and IMO preferable, to the error-prone approach of trying to enforce conventions (e.g. "this variable must be accessed with `get/set_uint32_le`"). Hope my argument makes sense. – user786653 Sep 05 '11 at 10:41
  • Yes, your answer matches my intuition as well. I won't accept it yet, because I'm hoping someone can provide an authoritative answer, and I'll probably place a bounty on it, but thanks again :) – R. Martinho Fernandes Sep 05 '11 at 10:48
1

That is not going to work. Your class Derived is not a POD, therefore it depends on the compiler how it compiles your code. In another words - do not do it.

by the way, where are you releasing your objects? I see are creaing in-place your objects, but you are not calling destructor.

BЈовић
  • 62,405
  • 41
  • 173
  • 273
  • 1
    Boost's `offset_ptr` is not a POD either. What gives? (I deliberately ignored cleanup in this example. I can skip calling the destructor if there are no resources to cleanup, can't I?) – R. Martinho Fernandes Sep 04 '11 at 18:14
  • 1
    Cleanup is a minor nit. On the other hand - we can talk all night about UB (that is what the above example is). Most likely it will work, but it is not 100% guarantied. – BЈовић Sep 04 '11 at 18:32
  • 3
    Ok, it won't work, but not because it isn't POD. I assume that using `offset_ptr` in Boost does not lead to UB, because, er, well, it's in Boost. `offset_ptr` is not a POD, so being a POD is not a requirement here. I'm interested in knowing what **is a requirement for this**. – R. Martinho Fernandes Sep 04 '11 at 18:35
1

Absolutely not. Serialisation is a well established functionality that is used in numerous of situations, and certainly does not require PODs. What it does require is that you specify a well defined serialisation binary interface (SBI).

Serialisation is needed anytime your objects leave the runtime environment, including shared memory, pipes, sockets, files, and many other persistence and communication mechanisms.

Where PODs help is where you know you are not leaving the processor architecture. If you will never be changing versions between writers of the object (serialisers) and readers (deserialisers) and you have no need for dynamically-sized data, then PODs allow easy memcpy based serialisers.

Commonly, though, you need to store things like strings. Then, you need a way to store and retrieve the dynamic information. Sometimes, 0 terminated strings are used, but that is pretty specific to strings, and doesn't work for vectors, maps, arrays, lists, etc. You will often see strings and other dynamic elements serialized as [size][element 1][element 2]… this is the Pascal array format. Additionally, when dealing with cross machine communications, the SBI must define integral formats to deal with potential endianness issues.

Now, pointers are usually implemented by IDs, not offsets. Each object that needs to be serialise can be given an incrementing number as an ID, and that can be the first field in the SBI. The reason you usually don't use offsets is because you may not be able to easily calculate future offsets without going through a sizing step or a second pass. IDs can be calculated inside the serialisation routine on first pass.

Additional ways to serialize include text based serialisers using some syntax like XML or JSON. These are parsed using standard textual tools that are used to reconstruct the object. These keep the SBI simple at the cost of pessimising performance and bandwidth.

In the end, you typically build an architecture where you build serialisation streams that take your objects and translate them member by member to the format of your SBI. In the case of shared memory, it typically pushes the members directly on to the memory after acquiring the shared mutex.

This often looks like

void MyClass::Serialise(SerialisationStream & stream)
{
  stream & member1;
  stream & member2;
  stream & member3;
  // ...
}

where the & operator is overloaded for your different types. You may take a look at boost.serialize for more examples.

ex0du5
  • 2,586
  • 1
  • 14
  • 14
  • 3
    He isn't talking about serialization. – GManNickG Sep 04 '11 at 20:34
  • He most certainly is. I just reread the question to make sure that I wrote a relevant response, and it is fully applicable. Can you explain why you think otherwise? – ex0du5 Sep 04 '11 at 20:46
  • @GMan: He even mentions the boost solution for interprocess serialisation. I even pointed out why the offset solution was less flexible when the data becomes dynamic, and gave a solution that didn't have the poor performance or need for multipass. I have used shared memory in precisely this way for system communication frameworks, and gave a relevant response based on my experience. – ex0du5 Sep 04 '11 at 20:55
  • 1
    The word "serialization" only appears in your answer, and the title of the question is "Do I need to make a type a POD to persist it with a memory-mapped file?" Memory-mapping files is not synonymous with saving files, it's a unique subset. – GManNickG Sep 04 '11 at 20:58
  • @GMan: I even pointed out that serialisation is the appropriate solution for persistence and communication. I apologise for using a word that was not used in the original question, and I'm sorry it has caused you trauma. I am trying to provide the correct answer in software design and did not think I'd be charged with the horrible crime of using new words. – ex0du5 Sep 04 '11 at 21:09
  • @GMan: You continue to claim my answer has nothing to do with his question. I keep pointing out how I answered the question, how I provided a better solution than offsets, why my answer is relevant, ... Why do you think it is incorrect to point out you don't need PODs if you serialise your object to the memory mapped file instead of simply memcpying? How is that not a direct answer? What do you actually think serialisation is about? If you answer some of these questions, I might be able to figure out why you are throwing this fit. – ex0du5 Sep 04 '11 at 21:58
  • "Fit"...I don't think this means what you think it means. – GManNickG Sep 04 '11 at 22:05
  • 1
    @GMan: I'm not convinced you know what serialisation is used for. I think, from your repeated use of stressing that he is using memory-mapped files, that you must imagine serialisation is about disk files only. Now, since I mentioned that it is used in a number of technologies, including shared memory, in my response, I also don't think you read my response. So I think you may be the type that argues without the intellectual curiosity to be able to accept new information. So you throw fits instead. – ex0du5 Sep 04 '11 at 22:27
  • @GMan The question says in the very first sentence: "persisted directly to file", which is, by definition, serialization. Don't get wrapped up in the term "serialization", the meaning (in this context) is the same. Please refer to Wikipedia for a very succinct definition: [http://en.wikipedia.org/wiki/Serialization] – Kevin Williams Sep 06 '11 at 20:36
  • 2
    @Kevin the question also includes, *somewhere in the vicinity of question marks*, mention of requirements necessary to work in the scenario described. That's what I'm interested in. I already have a working solution, which I know is safe. I want to know how far could I go and still remain safe. Sorry if that wasn't clear from the original text and the comments I posted all over. – R. Martinho Fernandes Sep 06 '11 at 23:36
  • 2
    @R. Martinho Fernandes: I really don't understand all these objections. My response certainly applies _in_the_scenario_described_. In fact, I have used it precisely for a data store in shared memory for an OS extension I wrote years ago that injected DLLs into every process to intercept system APIs. The functions "persist" and "retrieve" in fact are very similar to mine, except that my data structures were serialised instead of bit-reinterpreted. The functional requirements are met, and what I presented is more powerful, explicitly extending the capabilities as requested. – ex0du5 Sep 07 '11 at 00:25
  • 1
    @R. Martinho Fernandes: When people come on SO and ask how to use a char array without opening up any chance of buffer overrun, it is often seen as quite acceptable to mention that using std::string does this automatically by managing the buffer for you. It usually is immaterial if the questioner already has a working copy function for the primary case. Offering a solution that scales better than the original suggestion and focusing on the functional requirements over an assumed implementation is one of the well-accepted means of solution on SO. These objections are baffling. – ex0du5 Sep 07 '11 at 00:30
  • 2
    @R. Martinho Fernandes: Here is a full implementation, if you are concerned about the time to switch over what you have. http://bit.ly/oqWFRH – ex0du5 Sep 07 '11 at 00:40
  • 3
    I'm not objecting to what you suggest here. In fact I may turn to a true serialisation scheme if I end up needing anything more complex than what I need right now (which is: a bunch of raw uninterpreted bytes, a bitfield, and one of those relative pointers). My reluctance to change comes from the current code being rather simple and lightweight. FWIW, I didn't downvote any of the answers presented here. Maybe I shouldn't have provided so much background to the question and stayed more focused on `boost::interprocess::offset_ptr`. – R. Martinho Fernandes Sep 07 '11 at 01:00