I could need some advice on how to handle a particular cross language search with Solr. I have documents in 3 languages (English, German, French). For simplicity let's assume it's just two languages (English and German). The documents are standardised in the sense that they contain the same parts (text_part1 and text_part2), just the language they are written in is different. The language of the documents is known. In my index schema I use one core with different fields for each language.
For a German document the index will look something like this:
- text_part1_en: empty
- text_part2_en: empty
- text_part1_de: German text
- text_part2_de: Another German text
For an English document it will be the other way around.
What I want to achieve: A user entering a query in English should receive both, English and German documents that are relevant to his search. Further conditions are:
- I want results with hits in text_part1 and text_part2 to be higher ranked than results with hits only in one field (tie value > 0).
- The queries will not be single words, but full sentences (stop word removal needed and partial hits [only a few words out of the sentences] must be valid).
- English and German documents must output into one ranking. I need to be able to compare the relevance of an English document to the relevance of a German document.
- the text parts need to stay separate, I want to boost the importance of (let's say part1) over the other.
My general approach so far has been to get a German translation of the user's query by sending it to a translation API. Then I want use an edismax query, since it seems to fulfill all of my requirements. The problem is that I cannot manage to search for the German query in the German fields and the English query in the English fields only. The Solr edismax documentation states that it supports the full Lucene query parser syntax, but I can't find a way to address different fields with different inputs. I tried:
q=text_part1_en: (A sentence in English) text_part1_de: (Ein Satz auf Deutsch) text_part2_en: (A sentence in English) text_part2_de: (Ein Satz auf Deutsch)
qf=text_part1_en text_part2_en text_part1_de text_part2_de
This syntax should be in line with what MatsLindh wrote in this thread. I tried different versions of writing this q, but whatever I do Solr always search for the full q string in all four fields given by qf, which totally messes up the result. Am I just making mistakes in the query syntax or is it even possible to do what I'm trying to do using edismax?
The only alternative I see is to use two separate edismax searches. One in English and one in German. But then I don't know how to combine the results. From what I understand the scores from two different searches are not comparable, correct?
The sources about multilingual search I encountered all seem to be concerned with a case in which the language of the query is unknown and needs to be detected, but afterwards only documents in the language of the query are relevant for the results. Though, it is totally possible that I don't know what exactly to look for due to a lack of understanding. I'm very new to using Solr. Any help is much appreciated. I'm using Solr 8.2.0.