I'm using lxml to sanitize html data, but in some cases lxml is removing also the valid tags. It removes iframe tags that have a valid host but starts with double slashes (//)
code example:
>>> cleaner = Cleaner(host_whitelist=['www.youtube.com'])
>>> iframe = '<iframe src="//www.youtube.com/embed/S2S5I5GHkDQ"></iframe>'
>>> cleaner.clean_html(iframe)
'<div></div>'
but for normal urls (without double slashes) it works fine
>>> cleaner = Cleaner(host_whitelist=['www.youtube.com'])
>>> iframe = '<iframe src="https://www.youtube.com/embed/S2S5I5GHkDQ"></iframe>'
>>> cleaner.clean_html(iframe)
'<iframe src="https://www.youtube.com/embed/S2S5I5GHkDQ"></iframe>'
What I have to do , to make lxml to understand that it's valid URL ?
Thanks.