Can anyone please suggest me a method by which a chm file can be indexed in such as pdfbox for pdf.
Asked
Active
Viewed 311 times
4
-
Apache Tika is more common to use with Lucene, I just didn't know about their support for CHM. So, accept deathy's answer, please. – ffriend Jun 13 '11 at 14:13
2 Answers
3
If you're talking about Microsoft Compiled HTML Help files, you can just extract text from them with JChm and then index it in a normal way.

ffriend
- 27,562
- 13
- 91
- 132
-
Be careful. There might be a binary and a textual (.xml stored as .hhk) index, and they might not contain the same things. – Marco van de Voort Jun 11 '11 at 20:06
-
I used ChmParser amd used its retrieve file and have put some workaround. It seems to work well , and the .hhc issue is resolved . Thnx again – Biswanath Chowdhury Jun 13 '11 at 11:47
3
If you have also other document formats which you need to index, you might find a better and more general solution in Apache Tika
They just added a CHM Parser recently (for reference: Support of CHM Format) and it will be in the next version.

Cristian Vat
- 1,602
- 17
- 18