How do we get the document file url using the Watson Discovery Service?

Question

I don't see a solution to this using the available api documentation.

It is also not available on the web console.

Is it possible to get the file url using the Watson Discovery Service?

Can you clarify precisely what you mean by the "file" url? The query API has a GET method that probably gets what you want: https://watson-api-explorer.mybluemix.net/apis/discovery-v1#!/Queries/get_v1_environments_environment_id_collections_collection_id_query — Colin Dean, Jan 30 '17 at 14:18
get a document from the collection, the "text" field returned in the response along with the url of the document containing the text — johnrao07, Jan 30 '17 at 15:42
The query GET response has a "results" array containing objects that have a "text" attribute containing the original text or "html" attribute containing the converter output. Do you mean the _original_ URL of the document? — Colin Dean, Jan 30 '17 at 16:38
yeah the original url of the document uploaded, I learned which is not possible from Anish, so got another way around — johnrao07, Jan 30 '17 at 17:04

score 3 · Accepted Answer · answered Jan 30 '17 at 16:35

3

If you need to store the original source/file URL, you can include it as a field within your documents in the Discovery service, then you will be able to query that field back out when needed.

answered Jan 30 '17 at 16:35

Anish

316
1
1

1

Can you tell an example of how can we include the url say, http://example.com/files/1233.docx in the document itself – Sanjay Kumar N S Jan 10 '18 at 07:25
Agree with Sanjay, that's will be great to solve some doubts from who wanna know how to do that. – Sayuri Mizuguchi Mar 10 '18 at 19:03
If I have my original file available in some sort of storage like a Object storage or if it's hosted on some other websites, then yes I can mention it in some field while ingesting it to Discovery service. But if I have it in my local, are you suggesting that I should upload it to some storage first before pushing. This might not be a feasible solution for most. – Mrutyunjaya Jena Mar 19 '18 at 04:20

score 1 · Answer 2 · answered Oct 18 '18 at 02:03

I also struggled with this request but ultimately got it working using Python bindings into Watson Discovery. The online documentation and API reference is very poor; here's what I used to get it working:

(Assume you have a Watson Discovery service and have a created collection):

# Programmatic upload and retrieval of documents and metadata with Watson Discovery

from watson_developer_cloud import DiscoveryV1
import os
import json

discovery = DiscoveryV1(
    version='2017-11-07',
    iam_apikey='xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx',
    url='https://gateway-syd.watsonplatform.net/discovery/api'
)

environments = discovery.list_environments().get_result()
print(json.dumps(environments, indent=2))

This gives you your environment ID. Now append to your code:

collections = discovery.list_collections('{environment-id}').get_result()
print(json.dumps(collections, indent=2))

This will show you the collection ID for uploading documents into programmatically. You should have a document to upload (in my case, an MS Word document), and its accompanying URL from your own source document system. I'll use a trivial fictitious example.

NOTE: the documentation DOES NOT tell you to append , 'rb' to the end of the open statement, but it is required when uploading a Word document, as in my example below. Raw text / HTML documents can be uploaded without the 'rb' parameter.

url = {"source_url":"http://mysite/dis030.docx"}
with open(os.path.join(os.getcwd(), '{path to your document folder with trailing / }', 'dis030.docx'), 'rb') as fileinfo:
    add_doc = discovery.add_document('{environment-id}', '{collections-id}', metadata=json.dumps(url), file=fileinfo).get_result()
    print(json.dumps(add_doc, indent=2))
    print(add_doc["document_id"])

Note the setting up of the metadata as a JSON dictionary, and then encoding it using json.dumps within the parameters. So far I've only wanted to store the original source URL but you could extend this with other parameters as your own use case requires.

This call to Discovery gives you the document ID.

You can now query the collection and extract the metadata using something like a Discovery query:

my_query = discovery.query('{environment-id}', '{collection-id}', natural_language_query="chlorine safety")
print(json.dumps(my_query.result["results"][0]["metadata"], indent=2))

Note - I'm extracting just the stored metadata here from within the overall returned results - if you instead just had: print(my_query) you'll get the full response from Discovery ... but ... there's a lot to go through to identify just your own custom metadata.

How do we get the document file url using the Watson Discovery Service?

2 Answers2

Linked