First, you don't need to train your own neural network to extract features. You can get pretty far with a standard CNN that was trained on ImageNet or a similar large, generic image dataset. These models learn internal representations, that can be useful many different downstream visual tasks.
You can essentially chop the top of an ImageNet trained classifier off (that does the actual classification to the categories in the dataset), and use hidden actitivations from a prior layer as features for your image retrival tasks.
Here is a pytorch example:
import torch
from torch import nn
from torchvision.models import resnet34, ResNet34_Weights
from torchvision import transforms
import numpy as np
class FeatureExtractor(nn.Module):
def __init__(self, statedict_path):
super(FeatureExtractor, self).__init__()
self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# get backbone
self.backbone = resnet34(weights=ResNet34_Weights.IMAGENET1K_V1)
self.embedding = torch.empty(0)
self.backbone.avgpool.register_forward_hook(self.get_activation())
self.backbone.eval()
# load trained parameters
loading_result = self.load_state_dict(torch.load(statedict_path))
print(loading_result)
self.to(self.device)
self.transforms = ResNet34_Weights.IMAGENET1K_V1.transforms()
return
def get_activation(self):
def fn(_model, _input, output):
self.embedding = torch.squeeze(output)
return fn
def forward(self, x):
with torch.no_grad():
# prepare sample or batch
if type(x) != torch.Tensor:
x = self.transforms(x)
if x.ndim == 3:
x = x.unsqueeze(0)
# inference time!
x = x.to(self.device)
_ = self.backbone(x) # note: we are not using the output (ImageNet logits)
return self.embedding # instead we return the embedding that we captured with a forward hook
This uses a standard, pretrained torchvision model. Now of course you may try to optimize this model to your particular task and image distribution, but that is significnalty more difficult, and you need a labeled dataset.
It's worth noting that simple centering and normalization may significantly boost the performance, without any training! See: https://arxiv.org/abs/1911.04623
Also I heard some time ago (and can also confirm from my experiences) that older model, in particular VGG have more 'rich' internal representations. In case you are not planning to finetune the model, they may produce better results.
As for the second part of your question, I have limited experience here but Qdrant seems quite good. If memory is a concern for you, but you are ok with longer query times you can store all vectors on your disk, see here. Vector quantization is also possible, in practiacal cases it seems you don't even need to sacrifice too much accuracy for this.