0

I already installed the ADL extension in VS code and now i am writing a Python script, where i need to read a csv file present in Azure Data Lake Storage (ADLS Gen1). For local file the following code is working:

df = pd.read_csv(Path('C:\\Users\\Documents\\breslow.csv')) 
print (df)

How i can read data from ADLS? Even after successful installation and connection(with my Azure account) of ADL extension do i still need to go for creating a scope and secret and all ?

Gilles Heinesch
  • 2,889
  • 1
  • 22
  • 43
Lav Mehta
  • 92
  • 1
  • 2
  • 13

2 Answers2

0

Here is the sample code for reading csv file from ADLS.

# -*- coding: utf-8 -*-
"""
Created on Wed Mar 20 11:37:19 2019

@author: Mohit Verma
"""

from azure.datalake.store import core, lib, multithread
token = lib.auth(tenant_id, username, password)
adl = core.AzureDLFileSystem(token, store_name=store_name)

# typical operations
adl.ls('')
adl.ls('tmp/', detail=True)
adl.ls('tmp/', detail=True, invalidate_cache=True)
adl.cat('littlefile')
adl.head('gdelt20150827.csv')

# file-like object
with adl.open('gdelt20150827.csv', blocksize=2**20) as f:
    print(f.readline())
    print(f.readline())
    print(f.readline())
    # could have passed f to any function requiring a file object:
    # pandas.read_csv(f)

with adl.open('anewfile', 'wb') as f:
    # data is written on flush/close, or when buffer is bigger than
    # blocksize
    f.write(b'important data')

adl.du('anewfile')

# recursively download the whole directory tree with 5 threads and
# 16MB chunks
multithread.ADLDownloader(adl, "", 'my_temp_dir', 5, 2**24)

Please try this code and see if it helps.For other samples related to Azure Data Lake please refer to below github repo.

https://github.com/Azure/azure-data-lake-store-python/tree/master/azure

Also if you want to understand different type of authentication in ADLS, please check the below code base.

https://github.com/Azure-Samples/data-lake-analytics-python-auth-options/blob/master/sample.py

Mohit Verma
  • 5,140
  • 2
  • 12
  • 27
  • Hi @Mohit Verma, thanks for the detailed code, All the typical operations and other commands would be helpful only after i have successful access. After 3rd line of your code i am getting some proxy error "Failed to establish a new connection: [WinError 10060] ". could you help me with that?. – Lav Mehta Mar 27 '19 at 12:26
  • By looking at the error code, seems a proxy related issue. – Mohit Verma Mar 27 '19 at 12:55
  • This basically means that no response (either positive or negative) was received from the remote host when the TCP connection attempt took place. One reason this may happen is because a firewall is blocking the response from the server. Another reason is that the host name is incorrect. This could also mean there is a (temporary) problem with the server (or some router along the way). You can try a traceroute to determine whether or not this is true. – Mohit Verma Mar 27 '19 at 13:00
  • Thanks Mohit your code for typical operations was very helpful. I used the method mentioned above by doing App registration and it worked well. Thanks for your input! – Lav Mehta Apr 15 '19 at 07:54
0

I tried to write a sample code to read data from a csv file in Azure Data Lake to a dataframe in pandas.

Here is my sample code as below.

from azure.datalake.store import core, lib, multithread
import pandas as pd

tenant_id = '<your Azure AD tenant id>'
username = '<your username in AAD>'
password = '<your password>'
store_name = '<your ADL name>'
token = lib.auth(tenant_id, username, password)
# Or you can register an app to get client_id and client_secret to get token
# If you want to apply this code in your application, I recommended to do the authentication by client
# client_id = '<client id of your app registered in Azure AD, like xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxx'
# client_secret = '<your client secret>'
# token = lib.auth(tenant_id, client_id=client_id, client_secret=client_secret)

adl = core.AzureDLFileSystem(token, store_name=store_name)
f = adl.open('<your csv file path, such as data/test.csv in my ADL>')
df = pd.read_csv(f)

Note: If you were using client_id & client_secret for authentication, you must add the necessary access permission for the app which has Reader role at least in Azure AD, as the figures below. For more information about accessing security, please see the offical document Security in Azure Data Lake Storage Gen1. Meanwhile, about how to register an app in Azure AD, you can refer to my answer for the other SO thread How to get an AzureRateCard with Java?.

enter image description here

enter image description here

Any concern, please feel free to let me know.

Peter Pan
  • 23,476
  • 4
  • 25
  • 43
  • Hi @Peter_Pan, first of all thank you so much for your detailed reply. Actually when i use code mentioned by you without using client_id and client secret, Ie. only this part tenant_id = '' username = '' password = '' store_name = '' token = lib.auth(tenant_id, username, password) i am getting this error : Failed to establish a new connection: [WinError 10060] currently i don't have access to AAD that's why i choosed this way :-( – Lav Mehta Mar 25 '19 at 11:03
  • @LavMehta Fine, it's up to your real scenario. Whatever you used, to make it works fine is important first. If my answer helps, could you mark it as answer? – Peter Pan Mar 26 '19 at 00:59
  • Actually i am getting an error : Failed to establish a new connection: [WinError 10060] when i execute this line to be specific: token = lib.auth(tenant_id, username, password). and i tried marking your answer as useful but it shows : thanks for the feedback! Votes cast by those with less than 15 reputation are recorded, but do not change the publicly displayed post score. Sorry i think i need more reputation :-)) – Lav Mehta Mar 26 '19 at 08:11
  • Using Client Id and Client Secret it was much easier and error free to solve the problem. Thanks again – Lav Mehta Apr 15 '19 at 07:52