0

Whenever Solr is indexed to collection ( with configSet sample_techproducts_configs) and using URL, via following command:

bin/post -p 8983 -c collection https://www.mywebsite.com -recursive 3 

The indexes created do have a field content copied to text field. This field do have value of the content of web page parsed using embedded tika parse.

But, when those webpage contains any <script> or <style> tag the <body> is removed but the script or styles inside those respective tags remains as the content of the webpages, and shown in response to Solr Queries.

How To remove these unwanted content ?

S Jayesh
  • 191
  • 1
  • 4
  • 19

1 Answers1

0

Do read the inputstream of DATA_MODE_WEB in SimplePostTool (only for whom the content type is "text/html" and remove all <script> and <style> tags with its content and again convert that content_String to stream using stringToStream(String) in readPageFromUrl(URL u) function.

S Jayesh
  • 191
  • 1
  • 4
  • 19