Is there a way to index CHM files in Lucene?

Question

Can anyone please suggest me a method by which a chm file can be indexed in such as pdfbox for pdf.

Apache Tika is more common to use with Lucene, I just didn't know about their support for CHM. So, accept deathy's answer, please. — ffriend, Jun 13 '11 at 14:13

score 3 · Answer 1 · answered Jun 10 '11 at 13:53

3

If you're talking about Microsoft Compiled HTML Help files, you can just extract text from them with JChm and then index it in a normal way.

answered Jun 10 '11 at 13:53

ffriend

Be careful. There might be a binary and a textual (.xml stored as .hhk) index, and they might not contain the same things. – Marco van de Voort Jun 11 '11 at 20:06
I used ChmParser amd used its retrieve file and have put some workaround. It seems to work well , and the .hhc issue is resolved . Thnx again – Biswanath Chowdhury Jun 13 '11 at 11:47

score 3 · Accepted Answer · answered Jun 10 '11 at 16:06

3

If you have also other document formats which you need to index, you might find a better and more general solution in Apache Tika

They just added a CHM Parser recently (for reference: Support of CHM Format) and it will be in the next version.

answered Jun 10 '11 at 16:06

Cristian Vat

2 Answers2