0

I am using weaviate with the vectorizer semitechnologies/transformers-inference:sentence-transformers-paraphrase-multilingual-mpnet-base-v2, which is supposedly the best multi language vectorizer available. I am using it sort through German text:

import weaviate from 'weaviate-ts-client';

const client = weaviate.client({
    scheme: 'http',
    host: 'localhost:8080',
});



const schemaConfig = {
    class: 'Person',
    properties: [
        {
            name: 'name',
            dataType: ['string'],
        }
    ]
};
try {
    await client.schema.classDeleter().withClassName('Person').do();
} catch (e) {
    console.log(e)
}

await client.schema
    .classCreator()
    .withClass(schemaConfig)
    .do();


const schemaRes = await client.schema.getter().do();



await client.data.creator()
  .withClassName('Person')
  .withProperties({
           //night receptionist at a hotel
      name:"Nachtportier: (ca. 50%)" 
  })
  .do();

await client.data.creator()
  .withClassName('Person')
  .withProperties({
      //Mandate manager for agricultural fiduciary services e.g. fiduciary with federal certificate or equivalent training
      name:"Mandatsleiter Agrotreuhand z.B. Treuhänder mit eidg. Fachausweis oder gleichwertiger Ausbildung"
  })
  .do();


            //night receptionist at a hotel
const query="Nachtportier"

const resImage = await client.graphql.get()
  .withClassName('Person')
  .withFields(['name _additional{distance certainty id}'])
  .withNearText({concepts: [query]})
  .do();

console.log(resImage.data.Get.Person)

Executing which gives this result

[
  {
    _additional: {
      certainty: 0.8617309927940369,
      distance: 0.276538,
      id: '20b34dc7-d7d7-4d00-8c1d-93022960f224'
    },
    name: 'Mandatsleiter Agrotreuhand z.B. Treuhänder mit eidg. Fachausweis oder gleichwertiger Ausbildung'
  },
  {
    _additional: {
      certainty: 0.7965770363807678,
      distance: 0.40684593,
      id: '0b072dab-2c93-4189-b713-101fec7248b3'
    },
    name: 'Nachtportier: (ca. 50%)'
  }
]

Even though the first text has nothing to do with night or hotels, it is still much higher than the literal query word. In the full application there is a lot of text that also has nothing to do with the input query being scored much higher than the second text. This is also the case with many other text queries in the final app. The vectorizer sometimes works and suggests the right jobs, but theres also many, many cases like this one where it scores unrelated texts unreasonably. High. Am I using the system wrong or is it just not mature technology yet?

user2741831
  • 2,120
  • 2
  • 22
  • 43

1 Answers1

1

Am I using the system wrong or is it just not mature technology yet?

As you probably know, the quality of search results depends crucially on the vectorizer, rather than on Weaviate. Try another model for comparison, e.g. one from OpenAI?

Something else to try is a hybrid search, which can be configure to rank higher literal matches via its alpha parameter.

Dan Dascalescu
  • 143,271
  • 52
  • 317
  • 404
  • 1
    openai is english only, cohere is multilingual but insanely expensive. But I will try that next. Even so even a weak model should not be this terrible – user2741831 May 09 '23 at 17:28
  • I am having the same issue as @user2741831 I tried my own vectors, no success, then i tried the module "text2vec-transformers" without any changes. Surprisingly, when I used SentenceTransformers I got very distinct dot product results on weaviate whilst dot_score from SentenceTransformers utils was resulting in a very good result with a distance of string against a paragraph which contains the exact string – dc10 Jun 15 '23 at 13:37