How can I index HTML documents?

Question

I am using Lucene .NEt to do full-text searching. Till now I have been indexing PDF docs, but now I have a few webpages that I need to index. What's the best/easiest way to index HTML documents to add to my Lucene index? I am using .NET/C#

score 1 · Answer 1 · answered Mar 23 '10 at 09:57

1

I am currently working on this problem, the best answer I have found to date is using the HTML Agility Pack to get the plain text content out of the HTML.

answered Mar 23 '10 at 09:57

Adam Pope

3,234
23
32

score -3 · Answer 2 · answered Dec 17 '09 at 02:01

-3

Google can index your content for you.

answered Dec 17 '09 at 02:01

Pierreten

9,917
6
37
45

Not only does the asker *specifically* state that they are using Lucene .NET, even if using Google was an option this answer doesn't contain any real information on how to achieve this. – Justin Feb 10 '11 at 06:57

How can I index HTML documents?

2 Answers2