2

We have an Azure Hadoop HDI system where most of the files are stored in an Azure Storage Account Blob. Accessing the files from Hadoop requires the WASBS:// file system type.

I want to configure SQL 2016 Polybase to pushdown compute to the HDI cluster for certain queries of data stored in the Azure blobs.

It is possible to use Azure Blobs outside Hadoop in Polybase. I completely understand that the query hint "option (FORCE EXTERNALPUSHDOWN)" will not work on the Blob system.

Is it possible to configure an external data source to use HDI for compute on the blob?

A typical external data source configuration is:

CREATE EXTERNAL DATA SOURCE AzureStorage with (  
        TYPE = HADOOP,   
        LOCATION ='wasbs://clustername@storageaccount.blob.core.windows.net',  
        CREDENTIAL = AzureStorageCredential  
); 

I believe as long as that WASBS is in there, that pushdown compute will not work.

If I change the above to use HDFS, then I can certainly point to my HDI cluster, but then what would the LOCATION for the EXTERNAL TABLE be?

If this is in WASBS, then how would it be found in HDFS? LOCATION='/HdiSamples/HdiSamples/MahoutMovieData/'

Surely there is a way to get Polybase to pushdown compute to an HDI cluster where the files are in WASBS. If not, then Polybase does not support the most common and recommended way to setup HDI.

I know the above is a lot to consider and any help is appreciated. If you are really sure it is not possible, just answer NO. Please remember though that I realize Polybase operating on Azure Blobs directly cannot pushdown compute. I want Polybase to connect to HDI and let HDI compute on the blob.

EDIT

Consider the following setup in Azure with HDI.

Note that the default Hadoop file-system is WASBS. That means using a relative path such as /HdiSamples/HdiSamples/MahoutMovieData/user-ratings.txt will resolve to wasbs://YourClusterName@YourStorageAccount.blob.core.windows.net/HdiSamples/HdiSamples/MahoutMovieData/user-ratings.txt.

CREATE EXTERNAL DATA SOURCE HadoopStorage with (  
        TYPE = HADOOP,   
        LOCATION ='hdfs://172.16.1.1:8020', 
        RESOURCE_MANAGER_LOCATION = '172.16.1.1:8050', 
        CREDENTIAL = AzureStorageCredential  
);

CREATE EXTERNAL TABLE [user-ratings] (
    Field1 bigint,
    Field2 bigint,
    Field3 bigint,
    Field4 bigint

)
WITH (  LOCATION='/HdiSamples/HdiSamples/MahoutMovieData/user-ratings.txt',
        DATA_SOURCE = HadoopStorage,
        FILE_FORMAT = [TabFileFormat]
    );

There are many rows in the file in Hadoop. Yet, this query returns 0.

select count(*) from [user-ratings]

When I check the Remote Query Execution plan, it shows:

<external_uri>hdfs://172.16.1.1:8020/HdiSamples/HdiSamples/MahoutMovieData/user-ratings.txt</external_uri>

Notice the URI is an absolute path and is set to HDFS based on the External Data Source.

The query succeeds and returns zero because it is looking for a file/path that does not exist in the HDFS file-system. "Table not found" is not returned in case of no table. That is normal. What is bad is the real table is stored in WASBS and has many rows.

What this all means is Pushdown Compute is not supported when using Azure Blobs as the Hadoop default file system. The recommended setup is to use Azure Blobs so that the storage is separate from compute. It makes no sense PolyBase would not support this setup, but as of now it appears not to support it.

I will leave this question up in case I am wrong. I really want to be wrong.

Richard Blevins
  • 115
  • 1
  • 9

1 Answers1

0

If you want PolyBase to pushdown computation to any hadoop/HDI cluster, you need to specify RESOURCE_MANAGER_LOCATION while creating the external data source. The RESOURCE_MANAGER_LOCATION tells SQL server where to submit a MR job.

Karthik
  • 106
  • 3
  • Thanks for the response. Setting the resource manager is something I previously tried. Unfortunately, I see no setup where WASBS is supported for pushdown compute. This is a severe limiter of the use of PolyBase in an Azure HDI setup. Please see my edit for more info. – Richard Blevins Feb 27 '17 at 16:38