4

I am trying to read a HTML document in R containing some vue.js script. This document contains tags with attributes containing @ symbol.

When I read the document using read_html in R the attributes containing @ symbol are not parsed correctly.

read_html("<html><title @click='method'>Hi</title></html>")
{xml_document}
<html>
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n<title>Hi</title>\n</head>

The whole @click attribute is missing from the title tag. Can someone please let me know how to read tag attributes containing @ character?

Another example with inconsistent behaviour:

read_html("<html><title @click='$vuetify.goTo(0, goToOptions)' id='scrollBtn' style='display:none;' v-scroll='scrollfun'>Hi</title></html>")
{xml_document}
<html>
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n<title gotooptions id="scrollBtn" style="display:none;" v-scroll="scrollfun">Hi</title>\n</head>
  • 2
    I think the note in the `?read_html` page is relevant: "HTML is normalized to valid XML" and [XML has rules about names](https://www.w3.org/TR/2008/REC-xml-20081126/#NT-Name) such that they need to start with a letter. So I don't think you'll be able to use `xml2` here. I'm not aware of any R package that does true HTML (not just XML-like) parsing. – MrFlick Mar 31 '19 at 23:23
  • Thank You @MrFlick for your input. I guess I am stuck then. On an unrelated note, Do you happen to know of any other true HTML parser in Python or Julia, I could try to learn to use packages from one of those languages to interface with R to perform better html parsing. – Rajesh Talluri Apr 01 '19 at 20:22
  • Maybe check here for something you can work with: https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers. I don't have any experience with alternative parsers. – MrFlick Apr 01 '19 at 20:43

0 Answers0