0

Okay, so I've been utilizing HTML tidy to convert regular HTML webpages into XHTML suitable for parsing. The problem is the test page I saved in firefox had its html apparently somewhat precleaned by firefox during saving, call this File F. Html tidy works fine on file F, but fails on the raw data written to a file via .NET (file N). Html tidy is complaining about form tags being intermixed with table tags. The Html isn't mine so I can't just fix the source.

How do I clean up file N enough so that it can be run through Html tidy? Is there a standard way of hooking into Firefox (completely programmically without having to use mouse or keyboard) or another tool that will apply extra fixes to the html?

Peter Smith
  • 849
  • 2
  • 11
  • 28

2 Answers2

1

I had been using HTML tidy for some time, but then found that I was getting better results from TagSoup.

It can be used as a JAXP parser, converting non-wellformed HTML on the fly. I usually let it parse the input for Saxon XQuery transformations.

But it can also be used as a stand-alone utility, as an executable jar.

Gunther
  • 5,146
  • 1
  • 24
  • 35
0

I wound up using SendKeys in C# and importing functions from user32.dll to set Firefox as the active window after launching it to the website I wanted (file:///myfilepathhere/).

SendKeys seemed to require running a windowed program, so I also added another executable which performs actions in its form_load() method.

By using alt+f, down six times, enter, wait for a bit, type full path file name, enter (twice) and then killing firefox, I was able to automate firefox's ability to clean some html up.

Peter Smith
  • 849
  • 2
  • 11
  • 28