0

I am working on an image search engine. I use an algorithm to extract features from images and get a 560 dimensions float vector. This dimension is too high for elasticsearch to index, es will be very slow. So I want to reduce dimensions of feature vector. One approach I am thinking is to hashcode a series numbers, for example, hash 20 numbers in a row as one hash value, so dimensions is 28, which is reasonable to ES. The problem is that I do not find any theory to support my solution. Is there any proved solution to solve this problem? Thanks in advance.

David
  • 1,646
  • 17
  • 22
  • 1
    Why do you think that ES will be too slow? Have you done some testing and witnessed issues? It is not uncommon for ES documents to contain hundreds of fields. The default limit is at 1000 fields per document, but I've seen cases with 3K fields per document and ES was humming along just fine. – Val Sep 28 '17 at 05:58
  • @val I just found this image match project(https://github.com/ascribe/image-match/tree/master/image_match), and its ES array is 648, I think my original understanding was wrong. Thanks for ur comment. – David Sep 28 '17 at 07:56

1 Answers1

0

560 dimensions is very high. The standard is 128 and can most likely can be reduced even farther downward using PCA.

You use PCA (Principle Component Analysis) to reduce dimensions. Basically it's similar to compression. Of course, you will lose some accuracy.

See https://en.wikipedia.org/wiki/Principal_component_analysis

Ethan Allen
  • 14,425
  • 24
  • 101
  • 194