0

I'm currently using the VEINS library and simulation package to do some experiments. Because these have a very long run time, I'm trying to use the university cluster servers (KITE 2.0/RHEL6.6/Lustre 2.5.29.ddnpf3) -- however, I've now encountered several different run time bugs, with the same code that runs perfectly fine on my local machine (Fedora 23). I'm looking for a way to easily debug this problem. I suspect that the cause lies somewhere in the different gcc version, or perhaps some other system level library that I can't change remotely (but I'm not sure). I'm certain that the OMNeT++ version is the same; the VEINS library is provided by me and is the same locally and remotely.

An example of the issues I've encountered is discussed here, which I eventually fixed like this (as far as I can tell, both versions have the same semantics... DimensionSet extends std::set, and DimensionSet::timeFreqDomain is a static const initialized with (Dimension::time, Dimension::frequency) as in the fix).

What is a good approach to look for the cause? Is there a simple way to "cross-compile" between these machines, or some way to diff the binaries to look for the cause? Where do I look for common ways to deal with problems like these?

Julian Heinovski
  • 1,822
  • 3
  • 16
  • 27
  • It sounds like a memory initialization issue, and the successful case is simply lucky. Perhaps a variable needs to be zero and happens to be on the "good" machine, but happens not to be on the failing machine. – donjuedo Mar 04 '16 at 19:23
  • In the issue I described, `Dimension::time` and `Dimension::freq` are initialized to the same values (at least, when running in GDB), and the set simply lacked the elements for some reason. I'd be more interested in approaches to spot these problems than identifying the source of this particular error, but thanks for the feedback :-). – Rens van der Heijden Mar 05 '16 at 17:42
  • turns out there is, indeed, some kind of issue with memory! For some reason, `Dimension::time` and `Dimension::frequency` had the same "unique" identifiers. I'm still tracking down why exactly this happens, because there is code that grabs an unused ID. It could be that the issue is related to parallel access to that function (although I assumed everything to be thread-safe...). – Rens van der Heijden Mar 06 '16 at 18:23

1 Answers1

3

I might have tracked the error down to an example of a static initialization order fiasco: MiXiM's Dimension::time is a static member, so it should not have been used to initialize other static members. Unfortunately, this is exactly what MiXiM (and, by extension, Veins) did, leading to such crashes.

I have pushed commit 7807f47c (part of Veins 4.4), which gets rid of almost all static members, so that the whole of the framework should be safer to use.

Christoph Sommer
  • 6,893
  • 1
  • 17
  • 35