Preferred (or recommended) way to store large amounts of simulation configurations, runs values and final results

Question

I am working with some network simulator. After making some extensions to it, I need to make a lot of different simulations and tests. I need to record:

simulation scenario configurations
values of some parameters (e.g. buffer sizes, signal qualities, position) per devices per time unit t
final results computed from those recorded values

Second data is needed to perform some visualization after simulation was performed (simple animation, showing some statistics over time).

I am using Python with matplotlib etc. for post-processing the data and for writing a proper app (now considering pyQt or Django, but this is not the topic of the question). Now I am wondering what would be the best way to store this data?

My first guess was to use XML files, but it can be too much overhead from the XML syntax (I mean, files can grow up to very big sizes, especially for the second part of the data type). So I tried to design a database... But this also seems to me to be not the proper way... Maybe a mix of both?

I have tried to find some clues in Google, but found nothing special. Have you ever had a need for storing such data? How have you done that? Is there any "design pattern" for that?

"database" can have quite a few different meanings... I think the questions here are: 1/ what your data looks like (structures, types etc) 2/ how do you plan on using your data (queries, concurrent access, etc) 3/ are your data critical ? (what happens if they are lost, corrupt, whatever ?) 4/ and of course how much data you will have... — bruno desthuilliers, Jun 28 '12 at 11:05
1. Currently I have simple simulation txt log files, which then I parse and create a simple collections e.g. mobile terminal position at time x is the array (Terminals x TimeSlots) of tuples (x,y). Then usually I save those parsed data to multiple csv files. But in designed app I am considering to pack this data in some classes, which then can simplify the development of visualization and gui part. — Kokos, Jun 28 '12 at 11:32
2. There will not be any concurrent access, I suppose that only one person will be using the same data in the same time. My first, draft vision is simple select simulation, show result, play mobility viz etc. but I want to make it flexible for further extensions. 3. I don't think so 4. from tens of MB to some (I hope no more than 3 ;)) GB per simulation run. — Kokos, Jun 28 '12 at 11:38
I'm not writing an answer because 'use a database' has already been written, but if you **were** to go for a file format, I'd use JSON over XML as it's much less verbose, resulting in smaller file sizes. But yeah, use a database. — Josh Smeaton, Jun 30 '12 at 08:25

score 5 · Accepted Answer · answered Jun 30 '12 at 07:55

Separate concerns:

Apart from pondering on the technology to use for storing data (DBMS, CSV, or maybe one of the specific formats for scientific data), note that you have three very different kinds of data to manage:

Simulation scenario configurations: these are (typically) rather small, but they need to be simple to edit, simple to re-use, and should allow to reproduce a simulation run. Here, text or code files seem to be a good choice (these should also be version-controlled).
Raw simulation data: this is where you should be really careful if you are concerned with simulation performance, because writing 3 GB of data during a run can take a huge amount of time if implemented badly. One way to proceed would be to use existing file formats for this purpose (see below) and see if they work for you. If not, you can still use a DBMS. Also, it is usually a good idea to include a description of the scenario that generated the data (or at least a reference), as this helps you managing the results.
Data for post-processing: how to store this mostly depends on the post-processing tools. For example, if you already have a class structure for your visualization application, you could define a file format that makes it easy to read in the required data.

Look for existing solutions:

The problem you face (How to manage simulation data?) is fundamental and there are many potential solutions, each coming with certain trade-offs. As you are working in network simulation, check out what capabilities other tools used in your community provide. It could be that their developers ran into problems you are not even anticipating yet (regarding reproducibility etc.), and already found a good solution. For example, you could check out how OMNeT++ is handling simulation output: the simulation configurations are defined in a separate file, results are written to vec and sca files (depending on their nature). As far as I understood your problems with hierarchical data, this is supported as well (vectors get unique IDs and are associated with an attribute of some model entity). Additional tools already work with these file formats, e.g. to convert them to other formats like CSV/MATLAB files, so you could even think of creating files in the same format (documented here) and to use existing tools/converters for post-processing.

Many other simulation tools will have similar features, so take a look at what would work best for you.

thank you for this very comprehensive explanation. Now I have a lot of work to do and to learn ;) — Kokos, Jul 03 '12 at 11:37

score 1 · Answer 2 · answered Jun 28 '12 at 11:57

1

It sounds like you need to record more or less the same kinds of information for each case, so a relational database sounds like a good fit-- why do you think it's "not the proper way"?

If your data fits in a collection of CSV files, you're most of the way to a relational database already! Just store in database tables instead, and you have support for foreign keys and queries. If you go on to implement an object-oriented solution, you can initialize your objects from the database.

answered Jun 28 '12 at 11:57

alexis

48,685
16
101
161

because I have also some more complex (hierarchical) data, which I haven't yet processed, e.g. I have vectors of double values per terminal per time unit (it can be more than one correlated vector) which then can have also some related and variable params (e.g. vector values can be grouped by few criteria which also vary in time and I need to easily compare those groups and it sizes). And when I have started to thinking about that, I have lost my confidence about which technology to use and start thinking that it is so complicated that it need to have a special solution... – Kokos Jun 28 '12 at 12:24
If it's just data in a many-to-one relationship to your test runs, all you need is a second table and a foreign key. It's a question of *simple* relational design. (You haven't said how much you know about databases, so if you know what I'm talking about and this is *not* the case, please explain the actual problem in your question.) – alexis Jun 28 '12 at 14:14

score 1 · Answer 3 · answered Jun 28 '12 at 12:15

1

If your data structures are well-known and stable AND you need some of the SQL querying / computation features then a light-weight relational DB like SQLite might be the way to go (just make sure it can handle your eventual 3+GB data).

Else - ie, each simulation scenario might need a dedicated data structure to store the results -, and you don't need any SQL feature, then you might be better using a more free-form solution (document-oriented database, OO database, filesystem + csv, whatever).

Note that you can still use a SQL db in the second case, but you'll have to dynamically create tables for each resultset, and of course dynamically create the relevant SQL queries too.

answered Jun 28 '12 at 12:15

bruno desthuilliers

75,974
6
88
118

Yes, my data structure is well-know (or will be ;) ) and stable. But, what is bad in DB solution is the lack of its portability. But I can export that data to csv... hmm... just something is pushing me away from db ;) – Kokos Jun 28 '12 at 12:27
@Kokos Databases can be quite portable - especially if you're going to be using something like sqlite, which is a file-based single-user database. If you can store your data in XML, you can store it in a database. It also makes parsing your data later pretty convenient. – Josh Smeaton Jun 30 '12 at 08:24

Preferred (or recommended) way to store large amounts of simulation configurations, runs values and final results

3 Answers3