1

I'm using lxml to sanitize html data, but in some cases lxml is removing also the valid tags. It removes iframe tags that have a valid host but starts with double slashes (//)

code example:

>>> cleaner = Cleaner(host_whitelist=['www.youtube.com'])
>>> iframe = '<iframe src="//www.youtube.com/embed/S2S5I5GHkDQ"></iframe>'
>>> cleaner.clean_html(iframe)
'<div></div>'

but for normal urls (without double slashes) it works fine

>>> cleaner = Cleaner(host_whitelist=['www.youtube.com'])
>>> iframe = '<iframe src="https://www.youtube.com/embed/S2S5I5GHkDQ"></iframe>'
>>> cleaner.clean_html(iframe)
'<iframe src="https://www.youtube.com/embed/S2S5I5GHkDQ"></iframe>'

What I have to do , to make lxml to understand that it's valid URL ?

Thanks.

user3164429
  • 140
  • 1
  • 10

1 Answers1

0

If you look at the docs for Cleaner (http://lxml.de/3.4/api/lxml.html.clean.Cleaner-class.html), it appears that by default these parameters are set to True:

embedded:
    Removes any embedded objects (flash, iframes)
frames:
    Removes any frame-related tags

So my first instinct would be to try cleaner = Cleaner(host_whitelist=['www.youtube.com'], embedded=False)

AutomaticStatic
  • 1,661
  • 3
  • 21
  • 42
  • You can see there "whitelist_tags: A set of tags that can be included with host_whitelist. **The default is iframe and embed** you may wish to include other tags like script, or you may want to implement allow_embedded_url for more control. Set to None to include all tags." also you can see in my example that when the schema provided to the host (https) it's working, so it's not related to the "embedded" argument – user3164429 Nov 19 '16 at 22:34