3

I am using the Mahout API within for a Naive Bayes Classifier. One of the functions is SparseVectorsFromSequenceFiles and although I have tried the old Google search, I still do not understanf what a sparse vector is. The closest to an explanation I have is this site which didn't help me understand tbh.

Ben Davison
  • 713
  • 7
  • 15
  • I just needed it explaining in a slightly different way. Sometimes helps to have someone explain it in a different way, and @dasblinkenlight did just that. – Ben Davison Aug 29 '15 at 15:33

1 Answers1

4

Conceptually, vectors represent a generalization of arrays, i.e. data structures that allow arbitrary access to its elements using an index. Java's built-in arrays, Vector<T> and ArrayList<T> are examples of data structures implementing a "regular" (dense) vector concept.

Dense vectors provide constant-time access to its elements by translating a vector index into a memory address using a simple formula baseAddress + index * elementSize. This means that the size in memory is proportional to the largest index that the vector needs to support.

While this is acceptable in situations when the number of elements that you wish to put in a vector and the highest possible index are relatively close to each other. However, if you wish to use indexes from a wide range to index a relatively small number of elements (say, 1,000 elements scattered across a vector with 100,000 indexes) allocating 100,000 spaces is wasteful. You can save memory at the expense of CPU cycles by implementing a data structure that exposes the interface of a vector, but uses a smaller amount of memory for its internal representation.

The example at your link shows one possible implementation. Other implementations are possible, depending on the distribution of indexes in your data. If the indexes are distributed randomly, you could use a HashMap<Integer,T> as your backing storage for a sparse vector. If indexes are clustered together, you could split your index space by "pages", and allocate a real array only to pages that you need. This implementation would be similar to the way the physical memory is allocated to virtual memory space.

Sergey Kalinichenko
  • 714,442
  • 84
  • 1,110
  • 1,523