5

I am currently working on a project where we have data stored on Azure Datalake. The Datalake is hooked to Azure Databricks.

The requirement asks that the Azure Databricks is to be connected to a C# application to be able to run queries and get the result all from the C# application. The way we are currently tackling the problem is that we have created a workspace on Databricks with a number of queries that need to be executed. We created a job that is linked to the mentioned workspace. From the C# application we are calling a number of API's listed here in this documentation to call an instance of the job and wait for it to be executed. However I have not been able to extract the result from any of the APIs listed in the documentation.

My question is this, are we taking the correct approach or is there something we are not seeing? If this is the way to go, what has been your experience in extracting the result from a successfully run job on Azure Databricks from a C# application.

Ryan Falzon
  • 329
  • 4
  • 15

2 Answers2

4

Microsoft has a nice architecture reference solution that might help you get some more insights too.

I'm not sure using the REST API is the best way to go to get your job output from Azure DataBricks.

First of all the REST API has a rate limit per databrick instance. It's not that bad at 30 requests per second but it strongly depend on the scale of your application and other uses of the databrick instance if that is sufficient. It should be enough for creating a job but if you want to poll the job status for completion it might not be enough.

There is also a limited capacity in datatransfer via the REST API. For example: As per the docs the output api will only returns the first 5MB of a run output. If you want larger results you'll have to store it somewhere else before getting it from your C# application.

Alternative retrieval method

In Short: Use Azure PaaS to your advantage with blobstorage and eventgrid.

This is in no way an exhaustive solution and I'm sure someone can come up with a better one, however this has worked for me in similar usecases.

What you can do is write the result from your job runs to some form of cloud storage connected to databricks and then get the result from that storage location later. There is a step in this tutorial that shows the basic concept for storing the results of a job with SQL data warehouse, but you can use any storage you like, for example Blob storage

Let's say you store the result in blobstorage. Each time a new job output is written to a blob, you can raise an event. You can subscribe to these events via Azure Eventgrid and consume them in your application. There is a .net SDK that will let you to do this. The event will contain a blob uri that you can use to get the data into your application.

Form the docs a blobcreated event will look something like this:

[{
  "topic": "/subscriptions/{subscription-id}/resourceGroups/Storage/providers/Microsoft.Storage/storageAccounts/my-storage-account",
  "subject": "/blobServices/default/containers/test-container/blobs/new-file.txt",
  "eventType": "Microsoft.Storage.BlobCreated",
  "eventTime": "2017-06-26T18:41:00.9584103Z",
  "id": "831e1650-001e-001b-66ab-eeb76e069631",
  "data": {
    "api": "PutBlockList",
    "clientRequestId": "6d79dbfb-0e37-4fc4-981f-442c9ca65760",
    "requestId": "831e1650-001e-001b-66ab-eeb76e000000",
    "eTag": "\"0x8D4BCC2E4835CD0\"",
    "contentType": "text/plain",
    "contentLength": 524288,
    "blobType": "BlockBlob",
    "url": "https://my-storage-account.blob.core.windows.net/testcontainer/new-file.txt",
    "sequencer": "00000000000004420000000000028963",
    "storageDiagnostics": {
      "batchId": "b68529f3-68cd-4744-baa4-3c0498ec19f0"
    }
  },
  "dataVersion": "",
  "metadataVersion": "1"
}]

It will be important to name your blobs with the required information such as job Id and Run Id. You can also create custom events, which will increase the complexity of the solution but will allow you to add more details to your event.

Once you have the blob created event data in your app you can use the storage SDK to get the blobdata for use in your application. Depending on your application logic, you'll also have to manage the job ID and run Id's in the application otherwise you run the risk of having job output in your storage that is no longer attached to a process in your app.

AimusSage
  • 696
  • 4
  • 5
  • Hi, thanks for your reply. So from what I understand Azure Databricks was not created with the intent to act as a Datastore for a web application for example, am I right? So let's consider I have all this data stored in a Datalake which is connected to Databricks, from your response, I think what would be best is to forget about running jobs off of Databricks, and set up some sort of Datastore and have all the data in Databricks exported to this data store, right? – Ryan Falzon Mar 25 '20 at 07:16
  • 1
    Yes. DataBricks is a really cool analytics platform, not a datastore The nice thing about microsoft is they provide reference architecture samples for a lot of the azure services. [This one](https://learn.microsoft.com/en-us/azure/architecture/reference-architectures/ai/batch-scoring-databricks) and [this one](https://learn.microsoft.com/en-us/azure/architecture/reference-architectures/ai/real-time-recommendation) both give some nice insights. – AimusSage Mar 26 '20 at 11:40
  • Thanks for the insight you provided! – Ryan Falzon Mar 27 '20 at 13:53
1

Your use case is to use databricks as a compute engine (something similar to MySQL) and get output into C# application . So the best way is to create tables in databricks and run those queries via ODBC connection . https://learn.microsoft.com/en-us/azure/databricks/integrations/bi/jdbc-odbc-bi This way you have more control over sql query output.

ashitabh
  • 13
  • 3
  • The limitation with this approach is, you can not host your C# application in Azure App service. Because you need to have driver installed on the machine where you are running the application. Any solution on that front? – Ganesh Apr 15 '21 at 05:45
  • That is the bad part of azure. They support sql server driver but not others. So option is to use windows container or simply migrate your code to Java or python and use jdbc jar driver for azure function . There is very limited support for C# for spark. – ashitabh Apr 16 '21 at 08:23