I'm using the Similarity plugin with org.apache.lucene.analysis.fr.FrenchAnalyzer and I got strange results when searching close terms for a word. Some of the candidate term representations are mutilated in a strange way like:
pieuvre -> pieuvr, mobile -> mobil, chouette -> chouet, pattes -> pate, mieux -> mieu and the 'central' word telephone -> telephon.
similarity with the term 'téléphone':
I join the tiny rdf example:
@prefix : <http://data.edf.fr/ontologies/ideeShaker#> .
@base <http://data.edf.fr/data/ideeShaker/>
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
<1> a :Idee;
:content "Le téléphone à pieuvre n’est pas mobile.";
rdfs:label "Test1" .
<2> a :Idee;
:content "Un tien vaut mieux que 2 téléphones";
rdfs:label "Test2" .
<3> a :Idee;
:content "La pieuvre a huit pattes";
rdfs:label "Test3" .
<4> a :Idee;
:content "La pieuvre et mobile Dick";
rdfs:label "Test4" .
<5> a :Idee;
:content "Pieuvre possède un portable";
rdfs:label "Test5" .
<6> a :Idee;
:content "Le téléphone à pieuvre c’est nul";
rdfs:label "Test6" .
<7> a :Idee;
:content "Le téléphone mobile c’est chouette";
rdfs:label "Test7" .
<8> a :Idee;
:content "Une pieuvre est mobile";
rdfs:label "Test8" .
<9> a :Idee;
:content "Une pieuvre et un téléphone";
rdfs:label "Test9" .
<10> a :Idee;
:content "Telephone à la pieuvre";
rdfs:label "Test10" .
Do you think this problem is related to the Lucene analyzer ot to the GraphDB integration?