0

I am trying to fill a dataset in an HDF5 file iteratively using HDFql. What I mean by iteratively, is that my simulator occasionally comes along with an update and I wish to dump some more data (which is contained in a std::vector) into my dataset. Weirdly though, something breaks after a few 'iterations' and my dataset begins to just fill with zeros.

Luckily, this error also occurs in a minimal example and seems to be reproducible with the below code:

#include <stdio.h>
#include <random>
#include <HDFql.hpp>

int main (int argc, const char * argv[]) {
    HDFql::execute("CREATE TRUNCATE FILE /tmp/test_random.h5");
    HDFql::execute("USE FILE /tmp/test_random.h5");
    HDFql::execute("CREATE GROUP data");
    HDFql::execute("CREATE CHUNKED DATASET data/vals AS SMALLINT(UNLIMITED)");
    HDFql::execute("CLOSE FILE");
    std::stringstream ss;
    std::random_device rd;
    std::mt19937 eng(rd());
    std::uniform_int_distribution<> dist_vals(0, 500);
    std::uniform_int_distribution<> dist_len(300, 1000);
    for(int i=0; i<500; i++)
    {
        const int num_values = dist_len(eng);
        std::vector<uint16_t> vals;
        for(int i=0; i<num_values; i++)
        {
            const int value = dist_vals(eng);
            vals.push_back(value);
        }
        HDFql::execute("USE FILE /tmp/test_random.h5");

        ss << "ALTER DIMENSION data/vals TO +" << vals.size();
        HDFql::execute(ss.str().c_str()); ss.str("");

        ss << "INSERT INTO data/vals(-" << vals.size() << ":1:1:" << vals.size() 
            << ") VALUES FROM MEMORY " 
            << HDFql::variableTransientRegister(vals.data());
        HDFql::execute(ss.str().c_str()); ss.str("");

        HDFql::execute("CLOSE FILE");
    }
}

This code runs for 500 'iterations', filling the data vector with a random amount of random data each time. In my latest run, everything beyond data cell 4065 in the final output hdf file was just zeros.

So my question is: what am I doing wrong here? Many thanks!

Edit

On further experimentation, I have come to the conclusion that this is possibly a bug in HDFql. Looking at the following example:

#include <stdio.h>
#include <random>
#include <HDFql.hpp>

int main (int argc, const char * argv[]) {
    HDFql::execute("CREATE TRUNCATE FILE /tmp/test_random.h5");
    HDFql::execute("USE FILE /tmp/test_random.h5");
    HDFql::execute("CREATE CHUNKED DATASET data/vals AS SMALLINT(0 TO UNLIMITED)");

    std::stringstream ss;
    std::random_device rd;
    std::mt19937 eng(rd());
    std::uniform_int_distribution<> dist_vals(0, 450);
    std::uniform_int_distribution<> dist_len(100, 300);
    int total_added = 0;

    for(int i=0; i<5000; i++)
    {
        const int num_values = 1024; //dist_len(eng);
        std::vector<uint16_t> vals;
        for(int j=0; j<num_values; j++)
        {
            const int value = dist_vals(eng);
            vals.push_back(value);
        }

        long long dim=0;
        ss << "SHOW DIMENSION data/vals INTO MEMORY " << HDFql::variableTransientRegister(&dim);
        HDFql::execute(ss.str().c_str()); ss.str("");

        ss << "ALTER DIMENSION data/vals TO +" << vals.size();
        HDFql::execute(ss.str().c_str()); ss.str("");

        ss << "INSERT INTO data/vals(-" << vals.size() << ":1:1:" << vals.size()
            << ") VALUES FROM MEMORY "
            << HDFql::variableTransientRegister(vals.data());
        HDFql::execute(ss.str().c_str()); ss.str("");

        total_added += vals.size();
        std::cout << i << ": "<<  ss.str() << ":  dim = " << dim
                << " : added = " << vals.size() << " (total="
                << total_added << ")" << std::endl;

    }

    HDFql::execute("CLOSE FILE");
}

This code keeps the size of the data constant at 1024 (num_values = 1024;) and should work fine. However, if this is changed to 1025, the bug appears and is evidenced by the console outputting:

....
235: :  dim = 240875 : added = 1025 (total=241900)
236: :  dim = 241900 : added = 1025 (total=242925)
237: :  dim = 0 : added = 1025 (total=243950)
238: :  dim = 0 : added = 1025 (total=244975)
239: :  dim = 0 : added = 1025 (total=246000)
....

Indicating that something breaks at iteration 470, since the dimension of the dataset is clearly not zero.

Weirdly, this does not explain why I was having this problem in the original example, since the size of the data array was capped to 500.

Mr Squid
  • 1,196
  • 16
  • 34

3 Answers3

1

You are using variable i both in the outer and inner for loop, which is wrong. Also, as a suggestion, the code snippet you have posted could be optimized with the following:

  1. No need to create group data as when you create the dataset data/vals, HDFql creates data as a group (if it does not exist) and vals as a dataset.

  2. No need to open and close the file /tmp/test_random.h5 inside the loop (as this has a performance penalty); just open the file in the beginning of your code and close it at the end of the code.

Here goes the code corrected/refactored:

#include <stdio.h>
#include <random>
#include <HDFql.hpp>

int main (int argc, const char * argv[]) {

    HDFql::execute("CREATE TRUNCATE AND USE FILE /tmp/test_random.h5");

    HDFql::execute("CREATE CHUNKED DATASET data/vals AS SMALLINT(0 TO UNLIMITED)");

    std::stringstream ss;
    std::random_device rd;
    std::mt19937 eng(rd());
    std::uniform_int_distribution<> dist_vals(0, 500);
    std::uniform_int_distribution<> dist_len(300, 1000);

    for(int i=0; i<500; i++)
    {
        const int num_values = dist_len(eng);
        std::vector<uint16_t> vals;
        for(int j=0; j<num_values; j++)
        {
            const int value = dist_vals(eng);
            vals.push_back(value);
        }

        ss.str("");
        ss << "ALTER DIMENSION data/vals TO +" << vals.size();
        HDFql::execute(ss);

        ss.str("");
        ss << "INSERT INTO data/vals(-" << vals.size() << ":1:1:" << vals.size() 
            << ") VALUES FROM MEMORY " 
            << HDFql::variableTransientRegister(vals);
        HDFql::execute(ss);

    }

    HDFql::execute("CLOSE FILE");

}
SOG
  • 876
  • 6
  • 10
  • Many thanks for the answer - the index ``i`` is definitely a typo. However, this code gives me the exact same bug/error - at some places it simply fills the dataset with zeros (see the screenshot https://i.stack.imgur.com/1WlWD.png). Also, the reason I open/close the file on each iteration is to simulate my actual use-case, where I need to dump data into the HDF5 at random intervals. – Mr Squid Sep 11 '19 at 01:48
  • Concerning opening/closing the file, it's probably better to flush (i.e. write) the data instead then by doing `HDFql::execute("FLUSH");` – SOG Sep 11 '19 at 10:20
1

To reply to your edit above, there is no issues in extending the dimension of the dataset with num_values set to 1025.

Here goes the code snippet that I used to test this:

#include <iostream>
#include "HDFql.hpp"

int main(int argc, char *argv[])
{

    char script[1024];
    int total_added = 0;
    int num_values = 1025;

    HDFql::execute("CREATE TRUNCATE AND USE FILE /tmp/test_random.h5");

    HDFql::execute("CREATE CHUNKED DATASET data/vals AS SMALLINT(0 TO UNLIMITED)");

    for(int i = 0; i < 5000; i++)
    {
        long long dim = 0;
        sprintf(script, "SHOW DIMENSION data/vals INTO MEMORY %d", HDFql::variableTransientRegister(&dim));
        HDFql::execute(script);

        sprintf(script, "ALTER DIMENSION data/vals TO +%d", num_values);
        HDFql::execute(script);

        total_added += num_values;
        std::cout << i << ": " << ":  dim = " << dim << " : added = " << num_values << " (total=" << total_added << ")" << std::endl;
    }

    HDFql::execute("CLOSE FILE");

}
SOG
  • 876
  • 6
  • 10
  • Sure, but in this example you are not actually writing any data into the dataset. The problem here occurs when data is written from a std::vector into the dataset as well. But this was helpful, since it indicates that the issue is not with expanding the dataset. – Mr Squid Sep 12 '19 at 04:36
0

So I figured out where the problem is - in the following, the first example works and the second does not:

Works

#include <stdio.h>
#include <random>
#include <HDFql.hpp>

int main (int argc, const char * argv[]) {
    int total_added = 0;
    std::random_device rd;
    std::mt19937 eng(rd());
    std::uniform_int_distribution<> dist_vals(0, 450);
    std::uniform_int_distribution<> dist_len(100, 300);
    const int fixed_buffer_size = 10000;

    HDFql::execute("CREATE TRUNCATE FILE /tmp/test_random.h5");
    HDFql::execute("USE FILE /tmp/test_random.h5");
    HDFql::execute("CREATE CHUNKED DATASET data/vals AS INT(0 TO UNLIMITED)");

    for(int i = 0; i < 5000; i++)
    {
        const int num_values = dist_len(eng);
        std::vector<int> vals(fixed_buffer_size);
        long long dim = 0;
        sprintf(script, "SHOW DIMENSION data/vals INTO MEMORY %d", HDFql::variableTransientRegister(&dim));
        HDFql::execute(script);

        sprintf(script, "ALTER DIMENSION data/vals TO +%d", num_values);
        HDFql::execute(script);

        for(int j=0; j<num_values; j++)
        {
            const int value = dist_vals(eng);
            vals.at(j) = value;
        }
        sprintf(script, "INSERT INTO data/vals(-%d:1:1:%d) VALUES FROM MEMORY %d", num_values, num_values, HDFql::variableTransientRegister(vals.data()));
        HDFql::execute(script);
        HDFql::execute("FLUSH");

        total_added += num_values;
        std::cout << i << ": " << ":  dim = " << dim << " : added = " << num_values << " (total=" << total_added << ")" << std::endl;
    }

    HDFql::execute("CLOSE FILE");
}

Fails

#include <stdio.h>
#include <random>
#include <HDFql.hpp>

int main (int argc, const char * argv[]) {
    int total_added = 0;
    std::random_device rd;
    std::mt19937 eng(rd());
    std::uniform_int_distribution<> dist_vals(0, 450);
    std::uniform_int_distribution<> dist_len(100, 300);

    HDFql::execute("CREATE TRUNCATE FILE /tmp/test_random.h5");
    HDFql::execute("USE FILE /tmp/test_random.h5");
    HDFql::execute("CREATE CHUNKED DATASET data/vals AS INT(0 TO UNLIMITED)");

    for(int i = 0; i < 5000; i++)
    {
        const int num_values = dist_len(eng);
        std::vector<int> vals(num_values);
        long long dim = 0;
        sprintf(script, "SHOW DIMENSION data/vals INTO MEMORY %d", HDFql::variableTransientRegister(&dim));
        HDFql::execute(script);

        sprintf(script, "ALTER DIMENSION data/vals TO +%d", num_values);
        HDFql::execute(script);

        for(int j=0; j<num_values; j++)
        {
            const int value = dist_vals(eng);
            vals.at(j) = value;
        }
        sprintf(script, "INSERT INTO data/vals(-%d:1:1:%d) VALUES FROM MEMORY %d", num_values, num_values, HDFql::variableTransientRegister(vals.data()));
        HDFql::execute(script);
        HDFql::execute("FLUSH");

        total_added += num_values;
        std::cout << i << ": " << ":  dim = " << dim << " : added = " << num_values << " (total=" << total_added << ")" << std::endl;
    }

    HDFql::execute("CLOSE FILE");
}

The only difference between the two is that in the first the size of the data buffer vals is fixed and that in the second the data buffer size is created dynamically and randomly.

I don't understand why this error occurs, since in c++ std::vectors are supposed to have the underlying data lie contiguous in memory and be fully compatible with C arrays and pointer magic. But clearly the compiler is doing something different in each example. Anyways, I hope this helps anyone else with this issue - the solution is to use fixed size data buffers.

Mr Squid
  • 1,196
  • 16
  • 34