4

I am trying to figure out the best data layout for my use case (a research project). This is not my speciality so while I can articulate what I want, and what I think may work, I am trying to steer away from failure paths.

For now, assume tha the raw data are similar to several large corpora of text that are split into sequences (e.g. sentences), which each include a number of tokens (e.g. words). I extract, process and save information on a sentence-token basis, but require different operations on it in following analyses. Specifically, each token, in each sentence, is associated with a large vector (which can be numerical) that is prepared by a number of operations that are already implemented. Each sequence is associated with some metadata. This operation and thereby the preparation of this data occurs only once.

So: the output of the inital operation is a three dimensional tensor D[x,y,z] plus metadata associated with the x dimension. The x dimension denotes the sequence, the y the token position in the sequence (but not the unique token-id, e.g. the word encoding, that is part of the sequence metadata), and the z are the columns (many thousands) of information for that token. So each sequence is associated with a matrix of tokens as rows, and information as columns. The metadata can probably be made to fit into the first row if necessary. Note that each sequence is of the same length.

Sequence 1
Meta-data: [..]
         Column 1 | Column 2 | ...
Token 1 |  [...]  |   [...]  | ...    
Token 2 |  [...]  |   [...]  | ...   
...
Token N |  [...]  |   [...]  | ... 

Sequence 2
Meta-data: [..]
         Column 1 | Column 2 | ...
Token 1 |  [...]  |   [...]  | ...    
Token 2 |  [...]  |   [...]  | ...   
...
Token N |  [...]  |   [...]  | ...

This data is ingested multiple times by different subsequent analyses. I therefore require different "views" of this data, as follows:

  1. I need to be able to query each sequence and get the full matrix of token->values. That is simply the output 3D tensor, where I query along the first dimension. It would be nice to be able to "slice" multiple sequences at once (e.g. random batches for ML models etc.)

  2. I need like to be able to query by unique token-id (e.g. the word "hello"), noting that each token may occur in several sequences and at different positions. This is not a query into the dimension of a tensor, but rather requries data that maps unique token-ids to their positions in the sequences (or metadata within each sequence allowing such a query).

  3. I finally generate and save further summary values for each token per sequence, that I seek to query extremely quickly, where other information from that sequence is not relevant.

What all subsequent modeling has in common is

  • I need as much RAM as possible for the subsequent analyses, or in other words, the data may or may not need to be pushed to disk. That is why I am looking for a solution that allows both in-memory and out-memory access. In particular, the whole tensor may not fit into memory at all (it is built up subsequently over the x dimension)

  • Given the fixed structure, indexing and slicing is relatively straightforward, but I may often need to select non-adjacent entries, such as tokens from unrelated sequences.

  • The whole thing should not bottleneck the subsequent analyses. It would also be beneficial if it is somewhat portable and does not require additional software, such that the results can be distributed and reproduced easily by other researchers. In fact, I would like to make this data available to download if it turns out to be possible (legally)

  • Since this is an input, I am primarily interested in speed to access these data from python or other languages.

Based on this, I have tentatively settled on using either h5py or pyTables, but I am open to other options.

While the data is large, it is not so large that disk space is an issue (on a moderately sized server). I further iterate over each sequence at least once to perform the initial operations. I therefore plan to save each required "view" into separate datasets, each layed out to enable efficient access.

My plan is as follows:

  1. I save the output tensor as a multi-dimensional array in pyTables. The index dimension is going to be the sequence number. I may query several sequences, but always ingest the 2D table of a whole sequence. My hope is that pyTables allows me to keep the whole 3D tensor on disk, and only read the required data into RAM.

  2. I will save a new dataset that has the unique token-id as index, sequence-id as second column and then the required information as array. This way, I can query by token-id and get all data associated in all sequences. This includes a lot of duplication, but should allow for fast querying (?)

  3. I will finally make a smaller dataset with the associated summary data for each token-id (as index) for each sequence.

Do you think that would be efficient in terms of computation time?

The other route I see would be a relational database, like SQL. Here, I could simply make entries for each actual word in a sequence, with associated token-id, sequence number and the data I need. An SQL query could then be used to get the data in any way I choose. Further, any metadata could be saved in other tables either by sequence or token without much restrictions.

However, I am not sure if that is the fastest option, since I do not require many things SQL provides, such as additional flexibility (my queries / views are fixed and indexing/slicing is always along a fixed dimension) or all the access protections and all that. Plus, portability is better if its just some dataset files.

I am also not sure how SQL handles in-memory and out-memory issues. There may be instances when large parts of my data actually fits in RAM, so I want the flexbility there as well.

Questions:

  • What is your sense of the best approach? Is my plan sound?

  • SQL seems clearly more flexible, is it perhaps even faster?

  • What I do not yet understand in HDF5 is how chunking and groups play into this. It seems I can not really chunk my data, because I need to be able to query non-successive data with high frequency. Is it correct that for my use-case, I should not chunk?

  • Similarly, groups and links. My data structure does not resemble a tree, because each token may occur in many sequences, which is why I chose to just produce different datasets entirely. Would it be more efficient to try to use hard links or groups?

  • How would the memory model of HDF5 work (as implemented in python)? Is it true that I can query, say, the 3D tensor, and only keep the results in memory, but also have a cache for sequences or tokens that are frequently queried?

If my description is not clear, please let me know. Thank you for taking the time to read all this.

IMA
  • 261
  • 2
  • 10
  • 1
    `h5py` is the interface between `HDF5` and `numpy` arrays. While it gives good access to multidimensional arrays, including the usual slicing, it is better for numeric values than for text (`numpy` unicode string dtypes). I haven't used Pytables much, but it's more oriented to `pandas` use. – hpaulj Oct 19 '19 at 01:03
  • All values will be encoded as numerical, so I think that is quite alright – IMA Oct 21 '19 at 08:59

1 Answers1

0

For anyone coming across this question, let me give you the result.

The above works as intended using pyTables. It can be made reasonably fast. However, the logic rapidly produces files of humorously gigantic proportions, so I can only recommend to find a different way. In particular, disk space turned out to be more problematic than RAM usage, especially is things can be sparsified.

A custom solution to subset the data into memory was more successful than using pyTables chunking. So in effect, in all but knife-edge cases, the above is probably not a good idea.

IMA
  • 261
  • 2
  • 10