1

Setup: I need to store feature vectors associated with string-string pairs. The string-string pairs encode an input-output relationship. There will be a relatively small number of inputs X (e.g. 5), and for each input x, there will be a relatively small number outputs Y|x (e.g. 10).

The question is, what data structure is fastest?

Additional relevant information:

  1. The outputs are generally different for each input, and it cannot be assumed that each X has the same number of outputs.
  2. Lookup will be done "many" times (perhaps 1000).
  3. Inputs will be sampled equally frequently, but for each input, usually one or 2 outputs will be accessed frequently, and the remainder will be accessed infrequently or not at all.

At present, I am considering three possibilities:

  1. list-of-lists: access outer list with index (representing input X[i]), access inner list with index (representing output Y[i][j]).
  2. hash-of-hashes: same as above.
  3. flat hash: key = (input,output).
  • I'm not sure I understand you. If `X` and `Y` are observable (possibly correlated) random variables in a modeling problem, then a feature vector would be a pair `[x, y]` of specific values of `X` and `Y`. I don't think this is what you mean. What do you want to keep in this data structure? – phs Mar 09 '13 at 22:53
  • I already know the X and Y values (strings). I referred to them as random values because they will be accessed according to a partially unknown probability distribution. Specifically, the stored values will be referenced by a conjugate gradient descent algorithm. – NaturalLinguist Mar 10 '13 at 02:42

1 Answers1

0

If you have strings, it's unclear how you would look up the index to use a list of lists efficiently without utilizing hashing anyway. If you can pass around something that keeps the reference to the index (e.g. if the set of outputs is fixed, and you can define an enumeration of them), instead of the string a list of lists would be faster (assuming you mean list in the 'not necessarily linked list' sense, with O(1) element access). Otherwise you may as well just hash directly and save yourself the effort.

If not, that leaves hash of hashes v. flat hash. What's your access pattern like? Are you always going to ask for X,Y, or would you ever need to access all outputs for X? Hash(X+Y) is likely roughly equivalent to hash(X) + hash(Y) (both are going to generally walk over all the letters to generate the hash. So individual hashes is more flexible, at a slight (almost certainly negligible) overhead. From 3, it sounds like you might need the hash of hashes, anyhow.

James
  • 8,512
  • 1
  • 26
  • 28
  • Thanks for your insight! I can define an enumeration of the strings, and for one sub-part of the calculation it is convenient to access all and only the outputs for a given input. For now, I'm just going with the hash-of-hashes structure, directly from the strings. If that turns out to be unacceptably slow, I will try the alternatives, and post again if one of them is dramatically faster. – NaturalLinguist Mar 10 '13 at 04:32