I am using weaviate with the vectorizer semitechnologies/transformers-inference:sentence-transformers-paraphrase-multilingual-mpnet-base-v2
, which is supposedly the best multi language vectorizer available. I am using it sort through German text:
import weaviate from 'weaviate-ts-client';
const client = weaviate.client({
scheme: 'http',
host: 'localhost:8080',
});
const schemaConfig = {
class: 'Person',
properties: [
{
name: 'name',
dataType: ['string'],
}
]
};
try {
await client.schema.classDeleter().withClassName('Person').do();
} catch (e) {
console.log(e)
}
await client.schema
.classCreator()
.withClass(schemaConfig)
.do();
const schemaRes = await client.schema.getter().do();
await client.data.creator()
.withClassName('Person')
.withProperties({
//night receptionist at a hotel
name:"Nachtportier: (ca. 50%)"
})
.do();
await client.data.creator()
.withClassName('Person')
.withProperties({
//Mandate manager for agricultural fiduciary services e.g. fiduciary with federal certificate or equivalent training
name:"Mandatsleiter Agrotreuhand z.B. Treuhänder mit eidg. Fachausweis oder gleichwertiger Ausbildung"
})
.do();
//night receptionist at a hotel
const query="Nachtportier"
const resImage = await client.graphql.get()
.withClassName('Person')
.withFields(['name _additional{distance certainty id}'])
.withNearText({concepts: [query]})
.do();
console.log(resImage.data.Get.Person)
Executing which gives this result
[
{
_additional: {
certainty: 0.8617309927940369,
distance: 0.276538,
id: '20b34dc7-d7d7-4d00-8c1d-93022960f224'
},
name: 'Mandatsleiter Agrotreuhand z.B. Treuhänder mit eidg. Fachausweis oder gleichwertiger Ausbildung'
},
{
_additional: {
certainty: 0.7965770363807678,
distance: 0.40684593,
id: '0b072dab-2c93-4189-b713-101fec7248b3'
},
name: 'Nachtportier: (ca. 50%)'
}
]
Even though the first text has nothing to do with night or hotels, it is still much higher than the literal query word. In the full application there is a lot of text that also has nothing to do with the input query being scored much higher than the second text. This is also the case with many other text queries in the final app. The vectorizer sometimes works and suggests the right jobs, but theres also many, many cases like this one where it scores unrelated texts unreasonably. High. Am I using the system wrong or is it just not mature technology yet?