I'm trying to extract the main body content from news sites & blogs.
The docs make it seem as though documents.analyzeSyntax
would work as expected with HTML by passing it a document
with the content
as the page's raw HTML (utf-8) and the document's type
set to HTML
. The docs definitely include HTML as a supported content type.
In practice, however, the resulting sentences and tokens are muddled with HTML tags as though the parser thinks the input is plain text. As it stands, this rules out the GC NL API for my use case, and presumably many others as processing web pages via natural language is a pretty common task.
For reference, here is an example by Dandelion API of the type of output one would expect given HTML input (or rather in this case a URL to an HTML page as input).
My question, then, is am I missing something, possibly invoking the API incorrectly, or does the NL API not support HTML?