Extract hyperlink's along with related text using azure cognitive services

Question

We have a requirement where we need to search the documents and give the related text along with the hyperlinks as present in the documents. Using the azure search we are able to get the text but not the hyperlinks associated with that text.

based on the below example, is there a way to get the hyperlink (https://stackoverflow.com) associated with the text when using Azure cognitive services? We need to search documents and return the related text along with the hyperlinks present in the documents.

e.g.
This is a text in the document which we have indexed using azure search.

Output from azure search:
This is a text in the document which we have indexed using azure search.

Saw the text analytics API, but I have not found anything related to the hyperlinks extraction along with the text.

If the hyperlink is inside the Word doc and you need to extract and specify key value of URL and value, you may need to pre-process the doc and extract the URL by using Azure Forms Recognizer Please checkout this document for the steps: Form Recognizer general document model (preview) - If the key pair is not part of the metadata in a Word doc, the search service wouldn't map it in that way. See this discussion [thread](https://learn.microsoft.com/en-us/answers/questions/953464/azure-search-word-document-hyperlinks), which I'd answered. — AjayKumar, May 24 '23 at 18:57

score 1 · Answer 1 · answered May 08 '23 at 13:23

1

Reading between the lines on your question, I'm assuming you're trying to index html documents with an azure search indexer, and the indexer is extracting only the human-readable text from the html?

You can control what data is extracted from your blobs by changing the "parsingMode" configuration on the indexer. The default value "default" will strip out all of the html markup. If you change the value to "text" you can index the full html (including markup element attributes like anchor hrefs).

I do not believe there's any way to configure an azure search indexer to strip all of the html markup except the hyperlinks though. If your scenario requires more complicated parsing like that, you'll need to do it yourself. Perhaps via a custom skill if you still want to utilize the rest of the indexer pipeline

answered May 08 '23 at 13:23

Austin

226
1
4

hi @Austin I tried the approach with simple html file. But still i cant see any links in the obtained result. I have changed the default value of indexer to text. In file:
Hi
link here
hello there this is a sample file that has link and the text passes on
From search: "content": "Hi\n\nlink here\nhello there this is a sample file that has link and the text passes on\n\n", – mukundha reddy May 09 '23 at 07:02
But when the same file converted as PDF, we are able to get the links, but they are coming at the end of the file as shown below. "content": "\nHi\nlink here\n\nhello ther, e this is a sample file that has link and the text passes on\n\nhttp://www.bing.com/\nhttps://www.google.com/\n\n", – mukundha reddy May 09 '23 at 07:17

Extract hyperlink's along with related text using azure cognitive services

1 Answers1

Hi