2

We're just getting started evaluating the datalake service at Azure. We created our lake, and via the portal we can see the two public URLs for the service. (One is an https:// scheme, the other an adl:// scheme)

The datalake documentation states that there are indeed two interfaces: webHDFS REST API, and ADL. So, I am assuming the https:// scheme gets me the wehHDFS interface. However, I can find no more information at Azure about using this interface.

I tried poking at the given https:// URL, with web browser and curl. The service is responding. Replies are JSON, which is as expected, since a datalake is an instance of Hadoop. However, I cannot seem to get access to my files [which I uploaded into our lake via the portal].

If I do a GET to "/foo.txt", for example, the reply is an error, ResourceNotFound.

If I do a GET using the typical Hadoop HDFS syntax, "/webhdfs/v1/foo.txt", the reply is an error, AuthenticationFailed. Additional text indicates a missing access token. This seems more promising. However, can't find anything about generating such an access token.

There is some documentation on using the ADL interface, and .NET and Visual Studio, but this is not what I want, initially.

Any help much appreciated!

GregGalloway
  • 11,355
  • 3
  • 16
  • 47
RickS
  • 1,071
  • 1
  • 9
  • 8

1 Answers1

3

I am indebted to this forum post by Matthew Hicks which outlined how to do this with curl. I took it and wrapped it in PowerShell. I'm sure there are many ways to accomplish this, but here's one that works.

First setup an AAD application so that you can fill in the client_id and client_secret mentioned below. (That assumes you want to automate this rather than having an interactive login. If you want an interactive login, then there's a link to that approach in the forum post above.)

Then fill in the settings in the first 5 lines and run the following PowerShell script:

$client_id = "<client id>";
$client_secret = "<secret>";
$tenant = "<tenant>";
$adlsAccount = "<account>";
cd D:\path\to\curl

#authenticate
$cmd = { .\curl.exe -X POST https://login.microsoftonline.com/$tenant/oauth2/token  -F grant_type=client_credentials       -F resource=https://management.core.windows.net/       -F client_id=$client_id       -F client_secret=$client_secret };
$responseToken = Invoke-Command -scriptblock $cmd;
$accessToken = (ConvertFrom-Json $responseToken).access_token;

#list root folders
$cmd = {.\curl.exe -X GET -H "Authorization: Bearer $accessToken" https://$adlsAccount.azuredatalakestore.net/webhdfs/v1/?op=LISTSTATUS };
$foldersResponse = Invoke-Command -scriptblock $cmd;
#loop through directories directories
(ConvertFrom-Json $foldersResponse).FileStatuses.FileStatus | ForEach-Object { $_.pathSuffix }

#list files in one folder
$cmd = {.\curl.exe -X GET -H "Authorization: Bearer $accessToken" https://$adlsAccount.azuredatalakestore.net/webhdfs/v1/weather/?op=LISTSTATUS };
$weatherResponse = Invoke-Command -scriptblock $cmd;
(ConvertFrom-Json $weatherResponse).FileStatuses.FileStatus | ForEach-Object { $_.pathSuffix }

#download one file
$cmd = {.\curl.exe -L "https://$adlsAccount.azuredatalakestore.net/webhdfs/v1/weather/2007small.csv?op=OPEN" -H "Authorization: Bearer $accessToken" -o d:\temp\curl\2007small.csv };
Invoke-Command -scriptblock $cmd;


#upload one file
$cmd = {.\curl.exe -i -X PUT -L "https://$adlsAccount.azuredatalakestore.net/webhdfs/v1/weather/new2007small.csv?op=CREATE" -T "D:\temp\weather\smallcsv\new2007small.csv" -H "Authorization: Bearer $accessToken" };
Invoke-Command -scriptblock $cmd;
GregGalloway
  • 11,355
  • 3
  • 16
  • 47
  • Wonderful! Between the links provided, and your examples, it is starting to clarify. As explained, we begin by getting an authorization token via AAD. Once a token is obtained, the service is accessed as per stock HDFS syntax, with the header addition to send in the token. Makes sense. Over on that forum, I added the additional question: can your datalake be configured to not require any authorization at all? – RickS Apr 05 '16 at 20:28
  • @RickS that's a good question. I don't know how to setup anonymous access. If you figure it out please do post back here. – GregGalloway Apr 05 '16 at 20:39