Questions tagged [azure-data-lake]

Azure Data Lake Analytics is a suite of three big data services in Microsoft Azure: HDInsight, Data Lake Store, and Data Lake Analytics. These fully managed services make it easy to get started and easy to scale big data jobs written in Hive, Pig, Spark, Storm, and U-SQL.

Azure Data Lake Analytics is a suite of three big data services in Microsoft Azure: HDInsight, Data Lake Store, and Data Lake Analytics. These fully managed services make it easy to get started and easy to scale big data jobs written in, U-SQL, Apache Hive, Pig, Spark, and Storm.

  • HDInsight is a fully managed, monitored and supported Apache Hadoop service, bringing the power of Hadoop clusters to you with a few clicks.
  • Data Lake Store is a cloud scale service designed to store all data for analytics. The Data Lake Store allows for petabyte sized files, and unlimited accounts sizes, surfaced through an HDFS API enabling any Hadoop component to access data. Additionally, date in Data Lake Store is protected via ACL's that can be tied to an OAuth2 based identity, including those from your on-premises Active Directory.
  • Data Lake Analytics is a distributed service built on Apache YARN that dynamically scales on demand while you only pay for the job that is running. Data Lake Analytics also includes U-SQL, a language designed for big data, keeping the familiar declarative syntax of SQL, easily extended with user code authored in C#.

To learn more, check out: https://azure.microsoft.com/en-us/solutions/data-lake/

1870 questions
4
votes
2 answers

How to copy all files and folders in specific directory using azure data factory

I have one folder in adls gen2 say it as mysource1 folder .. which has 100's of subfolder s and each subfolder again contains folders and many files .. How can I copy all of the folders and files in mysource1 using azure data factory ..
maddy
  • 41
  • 1
  • 1
  • 2
4
votes
1 answer

Unzip in Azure Data Factory

I have a zip file with the size of 32GB. I am required to import this to a data lake storage service account. I am trying to unzip and move the file through Azure data factory. Zip file is uploaded to the Azure Blob Storage. However I cannot see the…
Harsha W
  • 3,162
  • 5
  • 43
  • 77
4
votes
1 answer

LeaseAlreadyPresent Error in Azure Data Factory V2

I am getting the following error in a pipeline that has Copy activity with Rest API as source and Azure Data Lake Storage Gen 2 as Sink. "message": "Failure happened on 'Sink' side.…
4
votes
1 answer

Azure data lake - read using Python

I am trying to read a file from Azure Data lake using Python in a Databricks notebook. this is the code I used, from azure.storage.filedatalake import DataLakeFileClient file =…
2713
  • 185
  • 1
  • 10
4
votes
1 answer

Spark.read() multiple paths at once instead of one-by-one in a for loop

I am running the following code: list_of_paths is a list with paths that end to an avro file. For example, ['folder_1/folder_2/0/2020/05/15/10/41/08.avro', 'folder_1/folder_2/0/2020/05/15/11/41/08.avro',…
NikSp
  • 1,262
  • 2
  • 19
  • 42
4
votes
1 answer

Solution for bussiness users to upload Data Lake ETL inputs

Question I think it's pretty common issue, hope there are solutions/approaches we can reuse. We're building data lake in Azure ADLS gen2, having unidirectional data flow: Nifi/ADF -> ADLS -> ETL/Spark/Databricks -> Data Warehouse -> Power BI. Some…
VB_
  • 45,112
  • 42
  • 145
  • 293
4
votes
1 answer

Difference between azure blob storage and azure data lake storage

It looks to be a confusion for users like me as what are the main differences between azure blob storage and azure data lake storage, and in what user case azure blob storage fits better than azure data lake storage, and vice versa? Thank you.
user13128577
4
votes
2 answers

get the latest added file in a folder [Azure Data Factory]

Inside the Data Lake, We have a folder that basically contains the files pushed by external source every day. However, we wanted to only process the latest added file in that folder. Is there any way to achieve that with Azure Data Factory?
OreoFanatics
  • 818
  • 4
  • 15
  • 32
4
votes
2 answers

Writing DataFrame to Parquet or Delta Does not Seem to be Parallelized - Taking Too Long

Problem Statement I've read a partitioned CSV file into a Spark Dataframe. In order to leverage the improvements of Delta Tables I'm trying to simply export it as Delta in a directory inside an Azure Data Lake Storage Gen2. I'm using the code below…
4
votes
2 answers

Rename written CSV file Spark throws Error "Path must be absolute" - Azure Data Lake

I tried solution described in Rename written CSV file Spark but I am getting the following error "java.lang.IllegalArgumentException: Path must be absolute". How could I fix it? It can be in scala or Python code. Thanks :) import…
user12525899
  • 133
  • 1
  • 10
4
votes
1 answer

Not able to see 'Lifecycle management' option for ADLS Gen2

I have created ADLS (Azure Data Lake Storage) Gen2 resource (StorageV2 with hierarchical name space enabled). The region I created the resource in is Central US and the performance/access tier is Standard/Hot and replication is LRS. But for this…
4
votes
1 answer

USQL Files vs. Managed Tables - how data is stored physically?

I am quite new to ADL and USQL. I went through quite a lot of documentation and presentations but I am afraid that I am still lacking many answers. Simplifying a bit, I have a large set of data (with daily increments) but it contains information…
Jakub Krupa
  • 127
  • 9
4
votes
1 answer

How to fix inconsistent schemas in parquet file partition using Spark

I am new to spark and I ran into a problem when appending new data to a partition. My pipeline ingresses daily CSVs into Azure Datalake (basically HDFS) using Databricks. I also run some simple transformations on the data and remove duplicates etc.…
4
votes
1 answer

Azure Databricks to Event Hub

I am very new to Databricks. So, pardon me please. Here is my requiremnt I have data stored in Azure DataLake As per the requirement, we can only access data via Azure Databricks notebook We have to pull the data from certain tables, join with…
Hillol Saha
  • 123
  • 1
  • 12
4
votes
4 answers

Azure MSI with AdlsClient: Access token expired

I am using Azure Managed Service Identity (MSI) to create a static (singleton) AdlsClient. I, then, use the AdlsClient in a Functions app to write to a Data Lake store. The app works fine for about a day but then it stops working and I see this…
MV23
  • 285
  • 5
  • 17