I have read a lot about image encoding techniques, e.g. Bag of Visual Words, VLAD or Fisher Vectors.
However, I have a very basic question: we know that we can perform descriptor matching (brute force or by exploiting ANN techniques). My question is: why don't we just use them?
From my knowledge, Bag of Visual Words are made of hundreds of thousands of dimensions per image to have accurate representation. If we consider an image with 1 thousand SIFT descriptors (which is already a considerable number), we have 128 thousands floating numbers, which is usually less than the number of dimensions of BoVW, so it's not for a memory reason (at least if we are not considering large scale problems, then VLAD/FV codes are preferred).
Then why do we use such encoding techniques? Is it for performance reasons?