What html parser should I use?

Question

I am working on a product where I need to parse a HTML document. I looked for Jericho, TagSoup, Jsoup and Crawl4J. Which parser should I use to parse HTML as I need to run this process in multi thread environment using quartz?

At a time if 10 thread run in memory, then I need an API which consumes less memory. In jericho, I read somewhere that it is text based search API and consumes less memory. Am I right? Or I should go for other and why?

score 2 · Accepted Answer · answered Sep 11 '12 at 11:41

Test them out and check their memory footprint. It's hard to make predictions on memory profiles without knowing and testing the HTML you're going to parse.

FFIW, I've used Jsoup in a number of different systems and I find that it works really well. I have never noticed any rampant memory issues with it either.

score 0 · Answer 2 · edited May 23 '17 at 12:19

0

I"m using JSoup and I'm very impressed. It's wicked fast at parsing, and it's CSS style pattern matching of content is much easier to maintain than XPath.

I tried Validator.nu's parser first, and found it very lacking. The documentation is very thin and I couldn't get it to properly execute XPaths that worked fine in Chrome.

Also, check out this question: Which HTML Parser is the best?

edited May 23 '17 at 12:19

Community

1
1

answered Jan 03 '14 at 18:28

Ryan Shillington

23,006
14
93
108

What html parser should I use?

2 Answers2