13

I'd like to specify all of Spark's properties in a configuration file, and then load that configuration file at runtime.

~~~~~~~~~~Edit~~~~~~~~~~~

It turns out I was pretty confused about how to go about doing this. Ignore the rest of this question. To see a simple solution (in Java Spark) on how to load a .properties file into a spark cluster, see my answer below.

original question below for reference purposes only.

~~~~~~~~~~~~~~~~~~~~~~~~

I want

  • Different configuration files depending on the environment (local, aws)
  • I'd like to specify application specific parameters

As a simple example, let's imagine I'd like to filter lines in a log file depending on a string. Below I've got a simple Java Spark program that reads data from a file and filters it depending on a string the user defines. The program takes one argument, the input source file.

Java Spark Code

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;

public class SimpleSpark {
    public static void main(String[] args) {
        String inputFile = args[0]; // Should be some file on your system

        SparkConf conf = new SparkConf();// .setAppName("Simple Application");
        JavaSparkContext sc = new JavaSparkContext(conf);
        JavaRDD<String> logData = sc.textFile(inputFile).cache();

        final String filterString = conf.get("filterstr");

        long numberLines = logData.filter(new Function<String, Boolean>() {
            public Boolean call(String s) {
                return s.contains(filterString);
            }
        }).count();

        System.out.println("Line count: " + numberLines);
    }
}

Config File

the configuration file is based on https://spark.apache.org/docs/1.3.0/configuration.html and it looks like:

spark.app.name          test_app
spark.executor.memory   2g
spark.master            local
simplespark.filterstr   a

The Problem

I execute the application using the following arguments:

/path/to/inputtext.txt --conf /path/to/configfile.config

However, this doesn't work, since the exception

Exception in thread "main" org.apache.spark.SparkException: A master URL must be set in your configuration

gets thrown. To me means the configuration file is not being loaded.

My questions are:

  1. What is wrong with my setup?
  2. Is specifying application specific parameters in the spark configuration file good practice?
Alexander
  • 1,673
  • 4
  • 19
  • 25

4 Answers4

10

try this

--properties-file /path/to/configfile.config

then access in scala program as

sc.getConf.get("spark.app.name")
VVN
  • 1,607
  • 2
  • 16
  • 25
Poojaa Karaande
  • 153
  • 2
  • 10
  • The problem with this solution is that spark ignores all config properties that are not spark related. Hence some property like `kafka.broker.url=someIp:9092` which would be important to the developer, would be ignored when spark sets up context. – Pritish Kamath Sep 27 '18 at 10:34
8

So after a bit of time, I realized I was pretty confused. The easiest way to get a configuration file into memory is to use a standard properties file, put it into hdfs and load it from there. For the record, here is the code to do it (in Java Spark):

import java.util.Properties;

import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;

SparkConf sparkConf = new SparkConf()
JavaSparkContext ctx = new JavaSparkContext(sparkConf);

InputStream inputStream;
Path pt = new Path("hdfs:///user/hadoop/myproperties.properties");
FileSystem fs = FileSystem.get(ctx.hadoopConfiguration());
inputStream = fs.open(pt);

Properties properties = new Properties();
properties.load(inputStream);
Alexander
  • 1,673
  • 4
  • 19
  • 25
  • It doesn't work for me. Still getting FileNotFoundException. – nish Oct 08 '15 at 18:28
  • are you putting the file in hdfs? Are you using aws? – Alexander Oct 08 '15 at 18:37
  • Yes I put the file in hdfs. Yes, I am using aws EMR – nish Oct 08 '15 at 18:39
  • Getting the following two exceptions in yarn logs : `java.io.FileNotFoundException: /etc/spark/conf/log4j.properties (No such file or directory)` and `java.io.FileNotFoundException: hdfs:/user/hadoop/test.properties (No such file or directory)` – nish Oct 08 '15 at 18:41
  • tricky to debug from here, but are you using enough /'s ? You need three if I'm not mistaken, instead of `hdfs:/user/hadoop/test.properties` try specifying the path as `hdfs:///user/hadoop/test.properties` – Alexander Oct 08 '15 at 18:43
  • Indeed. I have added the following in my code `String configLocation = "hdfs:///user/hadoop/test.properties/";` – nish Oct 08 '15 at 18:46
  • The solution mentioned at `http://stackoverflow.com/questions/31115881/how-to-load-java-properties-file-and-use-in-spark` worked for me. – nish Oct 08 '15 at 19:19
  • 1
    Alexander, I believe @Poojaa Karaande's answer below is more ideal as it does not require loading the configuration into HDFS and is natively supported by Spark. – Garren S Apr 12 '17 at 01:03
  • This doesn't solve the problem you came to ask. You are hard coding your configuration file location into your app which defeats the whole purpose. You won't be able to deploy this to other environments. You need to be able to reference a configuration file living anywhere. – JMess May 07 '19 at 15:48
6
  1. --conf only sets a single Spark property, it's not for reading files.
    For example --conf spark.shuffle.spill=false.
  2. Application parameters don't go into spark-defaults, but are passed as program args (and are read from your main method). spark-defaults should contain SparkConf properties that apply to most or all jobs. If you want to use a config file instead of application parameters, take a look at the Typesafe Config. It also supports environment variables.
Marius Soutier
  • 11,184
  • 1
  • 38
  • 48
5

FWIW, using the Typesafe Config library, I just verified that this work in ScalaTest:

  val props = ConfigFactory.load("spark.properties")
  val conf = new SparkConf().
    setMaster(props.getString("spark.master")).
    setAppName(props.getString("spark.app.name"))
  • 5
    How do you mention the properties file path? Where does it look for the file by default? What if the master node and node from which you run the application are different? – Ankit Khettry Jan 11 '17 at 14:28
  • I think this assumes the properties are available on the classpath – OneCricketeer Sep 26 '19 at 22:08