0

We want to index our client website and store all the data in IBM Watson Discovery service. When user asks question related to client data then (we will connect discovery with Watson Assistant). The chatbot should connect to Discovery and fetch the data to respond.

Problem: The client website has multiple links and each link will have further links, we want crawl all the data from website and index and store it in Watson Discovery service. We tried crawling the site but Discovery service is taking much time to crawl the site and also its not completed the task after 1 week also. Please let us know how we can achieve this in better and faster way.

data_henrik
  • 16,724
  • 2
  • 28
  • 49
user2319726
  • 143
  • 1
  • 1
  • 10
  • Are you trying out the (beta) web crawl? What are your settings, e.g., for hops? – data_henrik Jun 20 '19 at 07:42
  • we are trying with Lite plan..it can export 1000 documents but crawling is not able export any data – user2319726 Jun 20 '19 at 09:35
  • Please share in your question how you configured the crawl. How do you search? I am using the Lite plan and the beta web crawl in a similar scenario and it works. – data_henrik Jun 20 '19 at 10:25
  • @data_henrik, I tried with below steps.. connect a data source-->web crawl-->added the url --> save and sync.. Please let me know can we crawl the website with multiple links at a time by giving parent url alone ? – user2319726 Jun 20 '19 at 10:53
  • How many hops have you configured? How many start URIs? – data_henrik Jun 20 '19 at 12:53
  • @data_henrik, i have given only 1 url and hops is default number(2). – user2319726 Jun 21 '19 at 05:24

1 Answers1

0

Note that the web crawling is a current beta and the Watson Discovery documentation for web crawl states that, depending on the website, it will not ingest all data.

I used the web crawl in Discovery in a similar scenario like yours and query my website using a chat built with Watson Assistant. What you should do:

  • increase the number of hops: how deep should Watson Discovery crawl your website
  • depending on your website: add multiple entry points
  • specify all the paths that you want to exclude. I added those that would add duplicate entries and those generated summary pages, RSS feeds, etc.
  • adjust how often it should crawl
  • check that Watson Discovery can access your website and that your website does not block crawling
data_henrik
  • 16,724
  • 2
  • 28
  • 49
  • Thanks for your response. I will try with above options once again and check. – user2319726 Jun 21 '19 at 08:28
  • I am able to crawl the website data now but the documents its crawled is high(1000+) for simple website .. I want to have data of the website but its crawling other data aswell. is there option to control the document count for per page ? I want to crawl 20 pages which are all having less data in pages. I have free plan and want to get all the data with this plan only. – user2319726 Jun 26 '19 at 05:40
  • Please upvote and mark answered. Like I said, you can exclude paths from being crawled, you can limit the hops and have multiple entry points. – data_henrik Jun 26 '19 at 06:19
  • I need few clarification, please check below and let me know your thoughts 1. How to exclude paths ? Say i have given below url then what paths i need to exclude ? https://stackoverflow.com/questions/56681223/ibm-watson-discovery-crawling-issue 2. I have given 0 hops and tried but its not crawling the data, with 2 it crawled 1000 documents for simple page. 3.what you mean by multiple entry points ? – user2319726 Jun 26 '19 at 07:04
  • Stack Overflow is not for discussions. I would suggest to join the Watson Slack: http://wdc-slack-inviter.mybluemix.net/ – data_henrik Jun 26 '19 at 07:10