I have a very large Weaviate vector storage class (700,000 records) in which I pass my own custom vectors. I’m trying to get distances against a vector I pass as below. The vector is actually a match to one of the records, so I know the top hit should be the record with the identical vector (distance very close to 0). However, when I ask for top hits, the “closest” record is returning a distance of around 0.10, and this record is definitely not the record that matches my query vector perfectly (node_type="type1" instead of "type2").
# NOTE: mean_emb is a numpy array that matches a record pushed to the MyClass weaviate class.
# This theoretically should return distances from all 700k records to specified vector, since "distance" = 1.0, but I get why it wouldn't computationally
result = (client.query.get("MyClass", ["message", "node_type", "my_id", "timestamp"])
.with_near_vector({"vector": mean_emb.tolist(), "distance": 1.0})
.with_additional(["vector", "distance"]).do())
result = result["data"]["Get"]["MyClass"]
print(len(result)) # only 11,100 distances are returned
It looks like with_offset()
doesn’t like it when the offset is >100,000.
I have tried pagination using with_after()
but with_after doesn’t support queries with with_near_vector()
, and I have also tried with_offset()
+ with_limit()
, but this is terribly slow. Is there a workaround / what am I doing wrong here / how to query my class so that my top N query includes the true record match (distance close to 0)?
To prove there is in fact a record with distance ~0.000. Here’s the query that highlights the records that matches the vector:
where_filter = {"path": ["node_type"], "operator": "Equal", "valueText": "type2"}
result = (client.query.get("MyClass", ["message", "node_type", "my_id", "timestamp"])
.with_near_vector({"vector": mean_emb.tolist()})
.with_additional(["distance", "id"]).with_where(where_filter).do())
print(result)
Gives me this (I’ve changed the values of some of the record meta-data to protect data):
{'data': {'Get': {'MyClass': [{'_additional': {'distance': -1.9073486e-06,
'id': 'fdb00f95-2c07-462c-84cd-9380c6777801'},
'my_id': 'Record that matches the vector passed',
'message': None,
'node_type': 'type2',
'timestamp': None},
{'_additional': {'distance': 0.6122676,
'id': '0deb152a-eef0-485c-ad6e-c9e29f9a3915'},
'my_id': 'Another type2 record that doesn't match vector passed',
'message': None,
'node_type': 'type2',
'timestamp': None}]}}}