0

I need to clean up Html5 pages inside my Java project.

So I need a Java library, or a command line program working both on Linux and Windows.

JTidy doesn't work well (I tested it). HTML Tidy for HTML5 is a C++ Library and it's command line version works only on Linux.

Do you know if Validator.nu HTML Parser also cleans up (I didn't find any information about it)?

Have you any ideas?

Thanks

1 Answers1

0

Use JSoup. Well supported, no native components (should run everywhere Java does), free-but-very-liberal license. Also, supports HTML5

tucuxi
  • 17,561
  • 2
  • 43
  • 74
  • I tried to use Jsoup and the clean() method, but i didn't understand if I have to add manually ALL the Html5 tags to the WhiteList object(GULP!), or there's another way to clean the page... – Antonio Giovanni Schiavone Jul 12 '12 at 18:19
  • Depends on your requirements (see comment above). You may have enough with just tweaking the default Whitelist.relaxed(), for example. It covers most simple HTML. – tucuxi Jul 13 '12 at 09:44
  • The "relaxed" works fine for the body tags, but i didn't figure out how to add the head tags... – Antonio Giovanni Schiavone Jul 13 '12 at 09:59