Whenever Solr is indexed to collection ( with configSet sample_techproducts_configs
) and using URL, via following command:
bin/post -p 8983 -c collection https://www.mywebsite.com -recursive 3
The indexes created do have a field content
copied to text
field.
This field do have value of the content of web page parsed using embedded tika parse.
But, when those webpage contains any <script>
or <style>
tag the <body>
is removed but the script or styles inside those respective tags remains as the content of the webpages, and shown in response to Solr Queries.
How To remove these unwanted content ?