i am facing the problem of text search through a large amount of Arabic content documents (PDF and Doc files) in C# .Net.
after a lot and a lot searching, i came up with 2 solutions,
First, Lucene.Net and i faced the following issues
1- Arabic analyzer to be used with Lucene.Net and found this, don know yet if it will be working !
2- Extracting the text from the documents (about 6000 PDF and Doc files) and found Tika which i will be using in .Net with the help of ikvm. However, given that this solution will work, i don know the performance will be.
Second, Xapian and i moved to this solution in-order to make use of omega library, but still found some issues
1- Will xapian work with Arabic context or it will be needing an Arabic analyzer too and if so how will i work this problem around
indeed, i cant decide which solution to go with regarding Arabic content and an almost large amount of data.
Any help or suggestion is very appreciated,
Thanks,
Samer