Why should we use bag of visual words (or vlad) instead of storing descriptors?

Question

I have read a lot about image encoding techniques, e.g. Bag of Visual Words, VLAD or Fisher Vectors.

However, I have a very basic question: we know that we can perform descriptor matching (brute force or by exploiting ANN techniques). My question is: why don't we just use them?

From my knowledge, Bag of Visual Words are made of hundreds of thousands of dimensions per image to have accurate representation. If we consider an image with 1 thousand SIFT descriptors (which is already a considerable number), we have 128 thousands floating numbers, which is usually less than the number of dimensions of BoVW, so it's not for a memory reason (at least if we are not considering large scale problems, then VLAD/FV codes are preferred).

Then why do we use such encoding techniques? Is it for performance reasons?

score 1 · Answer 1 · edited May 28 '18 at 10:43

I had a hard time understanding your question.

Concerning descriptor matching, brute force, ANN matching techniques are used in retrieval systems. Recent matching techniques include KDtree, Hashing, etc.

BoVW is a traditional representation scheme. At one time BOVW combined with Inverted index was the state-of-the-art in information retrieval systems. But the dimension (memory usage per image) of BOVW representation (upto millions) limits the actual number of images that can be indexed in practice.

FV and VLAD are both compact visual representations with high discriminative ability, something which BoVW lacked. VLAD is known to be extremely compact (32Kb per image), very discriminative and efficient in retrieval and classification tasks.

So yes, such encoding techniques are used for performance reasons. You may check this paper for deeper understanding: Aggregating local descriptors into a compact image representation.

Why should we use bag of visual words (or vlad) instead of storing descriptors?

1 Answers1