0

We have a following requirement to ingest data into an Excel file.

  • Query data from parquet file stored in azure data lake storage gen2
  • Ingest data into an Excel file
  • Store the Excel file in SharePoint Online

What would be the optimal approach to implement this? Would it be better to implement server side CSOM or other Azure tools?

Your help is much appreciated.

Faizal
  • 353
  • 3
  • 16

1 Answers1

1

I probably would approach this like so (few components but the least amount of custom code)

Flow

  1. Use logic app to copy parquet file to Azure Data Lake (or blob) Storage and trigger Data Factory pipeline
  2. Data Factory Pipeline executes Databricks Python or Scala notebook (both can transform data save to Excel) as an output
  3. Use Logic App to transfer Excel back to SharePoint

Because

  • Logic Apps are amazing for SharePoint operations task orchestration.
  • Data Factory nicely orchestrates databricks notebook execution.
  • Databricks can easily read Parquets and as easily can save data as Excel files so it seems like the perfect choice.

While there are 3 components you need to use in best case scenario it will take literally under 10 lines of code to do what you need to. Rest is just 'clicking'.

You could potentially skip data factory and call Databricks via REST api with logic apps, but why bother when there are out of the box components to do that.

Adam Marczak
  • 2,257
  • 9
  • 20
  • Thanks Adam. Can the first step to copy parquet file to Azure blog storage be avoided, if ADF can directly connect to ADLS Gen2? – Faizal Sep 24 '20 at 20:25
  • 1
    ADLSg2 is a Blob storage with Hierarchical Namespaces enabled. So you can use logic apps to put file directly there too. Updated the answer for clarity. – Adam Marczak Sep 25 '20 at 06:59
  • Adam, Databricks python notebook 'save to excel' I am able to find sample code pointing pandas data frame.. however it is not recommended for larger files. Any other suggestions? thanks – Faizal Oct 07 '20 at 22:05
  • In general excel files are not designed for large files. Most libraries do not scale well, but for larger files from what I heard pandas is better option then for example spark-excel library. – Adam Marczak Oct 08 '20 at 09:24