2

I'm using the Similarity plugin with org.apache.lucene.analysis.fr.FrenchAnalyzer and I got strange results when searching close terms for a word. Some of the candidate term representations are mutilated in a strange way like: pieuvre -> pieuvr, mobile -> mobil, chouette -> chouet, pattes -> pate, mieux -> mieu and the 'central' word telephone -> telephon. similarity with the term 'téléphone':
similarity with the term 'téléphone'

I join the tiny rdf example:

@prefix : <http://data.edf.fr/ontologies/ideeShaker#> .
@base <http://data.edf.fr/data/ideeShaker/>
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<1> a :Idee;
  :content "Le téléphone à pieuvre n’est pas mobile.";
  rdfs:label "Test1" .

<2> a :Idee;
  :content "Un tien vaut mieux que 2 téléphones";
  rdfs:label "Test2" .

<3> a :Idee;
  :content "La pieuvre a huit pattes";
  rdfs:label "Test3" .

<4> a :Idee;
  :content "La pieuvre et mobile Dick";
  rdfs:label "Test4" .

<5> a :Idee;
  :content "Pieuvre possède un portable";
  rdfs:label "Test5" .

<6> a :Idee;
  :content "Le téléphone à pieuvre c’est nul";
  rdfs:label "Test6" .

<7> a :Idee;
  :content "Le téléphone mobile c’est chouette";
  rdfs:label "Test7" .

<8> a :Idee;
  :content "Une pieuvre est mobile";
  rdfs:label "Test8" .

<9> a :Idee;
  :content "Une pieuvre et un téléphone";
  rdfs:label "Test9" .

<10> a :Idee;
  :content "Telephone à la pieuvre";
  rdfs:label "Test10" .

Do you think this problem is related to the Lucene analyzer ot to the GraphDB integration?

Dharman
  • 30,962
  • 25
  • 85
  • 135

1 Answers1

0

You could use following "create index parameter":

-trainingcycles <number_of_training_cycles>.

The following will clear most of the unwanted results.

Note that this parameter should be used during index creation.

Sava Savov
  • 551
  • 2
  • 4
  • But I can't understand why this could avoid word trucating – Laurent Pierre Dec 01 '20 at 09:54
  • This won't avoid word truncating. It will remove echos. This is how the underlying Semanticvectors library works. The most similar build vector is always the root of the word (for English). – Sava Savov Dec 02 '20 at 08:33
  • the exposed roots seem weird in French – Laurent Pierre Dec 02 '20 at 11:36
  • This is how the Lucene's French (or any else) analyzer Stemmer works. In Lucene documentation is described how to handle this, – Sava Savov Dec 02 '20 at 14:59
  • If I understood well what I read on https://blog.trifork.com/2011/12/07/analysing-european-languages-with-lucene/, the problem arises from FrenchStemmer which is too aggressive against words and has to be replaced by FrenchLightStemmer. I don't know which stemmer is embedded in GraphDB and I don't know how to achieve the replacement if need be. – Laurent Pierre Dec 02 '20 at 16:11
  • Underlying Semanticvectors library is using exactly FrenchLightStemmer. As I look at similarity plugin codebase, there's no way to overcome this. I'll raise issue and it will be fixed in one of the following GDB releases. – Sava Savov Dec 02 '20 at 16:36
  • Thanks Sava, I look forward to having this release ;) – Laurent Pierre Dec 03 '20 at 08:37