1

I am writing a program that takes advantage of IBM Watson's Document Conversion service to convert documents of various types into answer units. Each answer unit that is returned by the service contains an array named content which is composed of objects having a media_type and a text element.

I've never seen more than one element in this content array, and I'm not sure how to handle them if there were. Can there ever be more than one element in this array and, if so, what are the possible values? Will they all have the same media_type value? My plan at the moment is to combine all of the text elements into one if more than one exists.

David Powell
  • 537
  • 1
  • 4
  • 16

2 Answers2

2

The answer unit content array can have more than one element (if you request that - see below). If it does, each element in the array will be a different media type representation of the same contents.

You can get this by putting more than one output media type in your request. When you do this, the output content array will contain more than element - with an element for each of the media types you request.

For example, if your request contained a config like this:

{
    conversion_target : 'answer_units',
    answer_units : {
        output_media_types : ['text/plain', 'text/html']
    }
}

(see https://www.ibm.com/watson/developercloud/document-conversion/api/v1/#convert-document for explanation of where you put config)

Then the content in your response will contain:

content : [
    {
        text : <the plain text contents of the answer unit>,
        ...
    },
    {
        text : <the HTML contents of the answer unit>,
        ...
    }
]

If you don't specify the output media type parameter, you'll get the default value which is:

        output_media_types : ['text/plain']

This is why you're always getting an array of length 1, with a text version of the output. Because implicitly, by leaving it with the default config, you're asking for one output media type.

dalelane
  • 2,746
  • 1
  • 24
  • 27
  • This is good information. Since I am not specifying an output_media_type, will there be only one element then? I am converting all file types that Document Conversion accepts. – David Powell Sep 09 '16 at 18:08
  • sorry, I should've included the default behaviour if you don't include the option - I've updated my answer to include that now. – dalelane Sep 10 '16 at 14:59
  • 1
    Just want to point out though -- the output_media_types option is un-documented because it's not currently a supported feature (e.g. it may disappear in the future). Also, the html that you get back in those snippets may be fragmented with unmatched tags. – Matt F Sep 12 '16 at 14:57
  • That's a good point - hadn't spotted that. I think I first learned about it from https://developer.ibm.com/answers/answers/244613/view.html instead of the API doc which probably should've been a clue! – dalelane Sep 12 '16 at 15:28
1

The Answer Units converter currently only splits by heading tags (<h1> and <h2> by default). If you want to split your answer units more granularly, you can change the level at which it splits by passing in a custom configuration:

{
    "answer_units": {
        "selector_tags": ["h1","h2","h3","h4","h5","h6"]
    }
}

See https://www.ibm.com/watson/developercloud/doc/document-conversion/customizing.shtml#htmlau

Matt F
  • 682
  • 5
  • 7
  • That gets you more than one element in the output `answer_units` array, doesn't it? (rather than in the output `content` array the OP asked about). Or am I misunderstanding something? – dalelane Sep 09 '16 at 13:47
  • You are absolutely correct; I didn't read the question closely enough. – Matt F Sep 12 '16 at 14:49