2

Is it possible to upload a document to a blob storage and do the following:

  1. Grab contents of document and add to index.
  2. Grab key phrases from contents in point 1 and add to index.

I want the key phrases then to be searchable.

I have code that can upload documents to a blobstorage which works perfect, but the only way to get this indexed(that I know of) is by using the "Import Data" within the Azure Search service, which creates and index with predefined fields - as below:

enter image description here

This works great when only needing these fields and the index gets updated automatically every 5 min. But becomes a problem when I want to have a custom Index

However, the only fields I DO want, are the following:

  • fileId
  • fileText(this is the content of the document)
  • blobURL(To allow downloading of the document)
  • keyPhrases(Which are to be pulled from fileText - I have code that does this as well)

The only issue I have is that I need to be able to retrieve the Document content(fileText) to be able to get the keyPhrases, but to my understanding, I can only do this if the Document Content is already in an index for me to access that Content?

I have very limited knowledge with Azure and struggling to find anything that similar to what I want to do.

The code that I am using to upload a document to my blob storage is as follows:

public CloudBlockBlob UploadBlob(HttpPostedFileBase file)
    {
        string searchServiceName = ConfigurationManager.AppSettings["SearchServiceName"];
        string blobStorageKey = ConfigurationManager.AppSettings["BlobStorageKey"];
        string blobStorageName = ConfigurationManager.AppSettings["BlobStorageName"];
        string blobStorageURL = ConfigurationManager.AppSettings["BlobStorageURL"];
        string UserID = User.Identity.GetUserId();
        string UploadDateTime = DateTime.Now.ToString("yyyyMMddhhmmss").ToString();

        try
        {
            var path = Path.Combine(Server.MapPath("~/App_Data/Uploads"), UserID + "_" + UploadDateTime + "_" + file.FileName);

            file.SaveAs(path);

            var credentials = new StorageCredentials(searchServiceName, blobStorageKey);

            var client = new CloudBlobClient(new Uri(blobStorageURL), credentials);

            // Retrieve a reference to a container. (You need to create one using the mangement portal, or call container.CreateIfNotExists())
            var container = client.GetContainerReference(blobStorageName);

            // Retrieve reference to a blob named "myfile.gif".
            var blockBlob = container.GetBlockBlobReference(UserID + "_" + UploadDateTime + "_" + file.FileName);

            // Create or overwrite the "myblob" blob with contents from a local file.
            using (var fileStream = System.IO.File.OpenRead(path))
            {
                blockBlob.UploadFromStream(fileStream);
            }

            System.IO.File.Delete(path);

            return blockBlob;
        }
        catch (Exception e)
        {
            var r = e.Message;
            return null;
        }
    }

I hope I havnt given too much information, but I dont know how else to explain what I am looking for. If I am not making sense, please let me know so that I can fix my question.

I am not looking for handout code, just looking for a shove in the right direction.

I would appreciate any help.

Thanks!

AxleWack
  • 1,801
  • 1
  • 19
  • 51

1 Answers1

3

We can use Azure Search to index document by Azure Search REST API and .NET SDK. According to your description, I create a demo with .NET SDK and test it successfully. The following is my details steps:

  1. Create Azure Search from the Azure Portal

enter image description here

  1. Get the Search Key from the Azure portal

enter image description here

  1. Create custom index field model

    [SerializePropertyNamesAsCamelCase] public class TomTestModel { [Key] [IsFilterable] public string fileId { get; set; } [IsSearchable] public string fileText { get; set; } public string blobURL { get; set; } [IsSearchable] public string keyPhrases { get; set; } }

4.Create DataSource

       string searchServiceName = ConfigurationManager.AppSettings["SearchServiceName"];
       string adminApiKey = ConfigurationManager.AppSettings["SearchServiceAdminApiKey"];
       SearchServiceClient serviceClient = new SearchServiceClient(searchServiceName, new SearchCredentials(adminApiKey));

       var dataSource = DataSource.AzureBlobStorage("storage name", "connectstrong", "container name");
        //create data source
        if (serviceClient.DataSources.Exists(dataSource.Name))
        {
            serviceClient.DataSources.Delete(dataSource.Name);
        }
        serviceClient.DataSources.Create(dataSource);
  1. Create custom index

var definition = new Index() { Name = "tomcustomindex", Fields = FieldBuilder.BuildForType<TomTestModel>() }; //create Index if (serviceClient.Indexes.Exists(definition.Name)) { serviceClient.Indexes.Delete(definition.Name); } var index = serviceClient.Indexes.Create(definition);

enter image description here

  1. Upload document to the index,more information about operation storage using SDK please refer to document

            CloudStorageAccount storageAccount = CloudStorageAccount.Parse("connection string");
            var blobClient = storageAccount.CreateCloudBlobClient();
            var container =blobClient.GetContainerReference("container name");
            var blobList = container.ListBlobs();
    
            var tomIndexsList = blobList.Select(blob => new TomTestModel
            {
                fileId = Guid.NewGuid().ToString(), blobURL = blob.Uri.ToString(), fileText = "Blob Content", keyPhrases = "key phrases",
            }).ToList();
            var batch = IndexBatch.Upload(tomIndexsList);
            ISearchIndexClient indexClient = serviceClient.Indexes.GetClient("index");
            indexClient.Documents.Index(batch);
    
  2. Check the search result from the search explore.

enter image description here

Page.config file:

<?xml version="1.0" encoding="utf-8"?>
<packages>
  <package id="Microsoft.Azure.KeyVault.Core" version="1.0.0" targetFramework="net452" />
  <package id="Microsoft.Azure.Search" version="3.0.0-rc" targetFramework="net452" />
  <package id="Microsoft.Data.Edm" version="5.6.4" targetFramework="net452" />
  <package id="Microsoft.Data.OData" version="5.6.4" targetFramework="net452" />
  <package id="Microsoft.Data.Services.Client" version="5.6.4" targetFramework="net452" />
  <package id="Microsoft.Rest.ClientRuntime" version="2.3.4" targetFramework="net452" />
  <package id="Microsoft.Rest.ClientRuntime.Azure" version="3.3.4" targetFramework="net452" />
  <package id="Microsoft.Spatial" version="6.15.0" targetFramework="net452" />
  <package id="Newtonsoft.Json" version="7.0.1" targetFramework="net452" />
  <package id="System.Spatial" version="5.6.4" targetFramework="net452" />
  <package id="WindowsAzure.Storage" version="7.2.1" targetFramework="net452" />
</packages>

TomTestModel file:

using System.ComponentModel.DataAnnotations;
using Microsoft.Azure.Search;
using Microsoft.Azure.Search.Models;

namespace TomAzureSearchTest
{
    [SerializePropertyNamesAsCamelCase]
    public class TomTestModel
    {
        [Key]
        [IsFilterable]
        public string fileId { get; set; }
        [IsSearchable]
        public string fileText { get; set; }
        public string blobURL { get; set; }
        [IsSearchable]
        public string keyPhrases { get; set; }
    }
}
Tom Sun - MSFT
  • 24,161
  • 3
  • 30
  • 47
  • Thanks Tom Sun. Looking at the code, I think I can see how these pieces fit together! Thanks! Just one thing, in the section where you are specifying the fileId, blobURL, fileText ect - the one problem I have is accessing the actual text within the document being uploaded - Your code suggests that I already have the content within the document, which I dont, as I need to grab this from the document being uploaded. Is this possible ? – AxleWack Dec 05 '16 at 08:41
  • More information about operation storage using SDK please refer to [document](https://learn.microsoft.com/en-us/azure/storage/storage-dotnet-how-to-use-blobs#download-blobs). – Tom Sun - MSFT Dec 05 '16 at 08:44
  • Great! Thanks! I will work through your code and details provided, as well as the link you provided and will get back to you. Thanks! – AxleWack Dec 05 '16 at 08:47
  • The [isFilterable] and [isSearchable] is giving an error. [SerializePropertyNamesAsCamelCase] I got working by referencing to Microsoft.Azure.Search.Models, but dont know what to reference to for [isFilterable] and [isSearchable]. The suggestion wants to change it to [Filterable] which is part of the System.Web.IU – AxleWack Dec 05 '16 at 09:00
  • I updated the answer and added the TomtestModel file. Please have a try. – Tom Sun - MSFT Dec 05 '16 at 09:04
  • I did have them, seems the issue was that I needed to update my Microsoft.Azure.Search to the latest version from nuget manager. Thanks. – AxleWack Dec 05 '16 at 09:08
  • We can restore the packages from the Package.config file. More package infomation please refer to the package.config file that I mentioned. – Tom Sun - MSFT Dec 05 '16 at 09:10
  • Perfect!!! By using your code and adding the code I have + a few changes, I got it working! Thank you!! Just one thing I noticed that doesnt seem right...When I use the DownloadText() to get the text from the document just uploaded, I get A LOT of jargon(i.e \u0000\u0000����\) - and I mean A LOT of it! Is there no way that when downloading the uploaded blobs text, that it does not include things like this ? I will mark your answer as the answer either way! Thanks for all the help!! – AxleWack Dec 05 '16 at 11:10
  • Please have a try to `DownloadText(Encoding.UTF8)` or the way mentioned in another [SO thread](http://stackoverflow.com/questions/11231147/cloudblob-downloadtext-method-inserts-additional-character?answertab=oldest#tab-top) – Tom Sun - MSFT Dec 05 '16 at 13:26
  • Thanks. Unfortunately didnt work. I will do some research and find a solution. Thanks for all the help! Its lead me in the right direction :) – AxleWack Dec 05 '16 at 17:14