I have a need to be able to run spark on my local machine to access azure wasb and adl urls, but I can't get it to work. I have a stripped down example here:
maven pom.xml (Brand-new pom, only the dependencies have been set):
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.3.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.3.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.8.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-azure-datalake</artifactId>
<version>3.1.0</version>
</dependency>
<dependency>
<groupId>com.microsoft.azure</groupId>
<artifactId>azure-storage</artifactId>
<version>6.0.0</version>
</dependency>
<dependency>
<groupId>com.microsoft.azure</groupId>
<artifactId>azure-data-lake-store-sdk</artifactId>
<version>2.2.3</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-azure</artifactId>
<version>3.1.0</version>
</dependency>
<dependency>
<groupId>com.microsoft.azure</groupId>
<artifactId>azure-storage</artifactId>
<version>7.0.0</version>
</dependency>
Java code (Doesn't need to be java - could be scala):
import org.apache.spark.SparkConf;
import org.apache.spark.SparkContext;
import org.apache.spark.sql.SparkSession;
public class App {
public static void main(String[] args) {
SparkConf config = new SparkConf();
config.setMaster("local");
config.setAppName("app");
SparkSession spark = new SparkSession(new SparkContext(config));
spark.read().parquet("wasb://container@host/path");
spark.read().parquet("adl://host/path");
}
}
No matter what I try I get:
Exception in thread "main" java.io.IOException: No FileSystem for scheme: wasb
Same for adl. Every doc I can find on this either just says to add the azure-storage dependency, which I have done, or says to use HDInsight.
Any thoughts?