Configure Zeppelin's Spark Interpreter on EMR when starting a cluster

Question

I am creating clusters on EMR and configure Zeppelin to read the notebooks from S3. To do that I am using a json object that looks like that:

[
  {
    "Classification": "zeppelin-env",
    "Properties": {

    },
    "Configurations": [
      {
        "Classification": "export",
        "Properties": {
        "ZEPPELIN_NOTEBOOK_STORAGE":"org.apache.zeppelin.notebook.repo.S3NotebookRepo",
          "ZEPPELIN_NOTEBOOK_S3_BUCKET":"hs-zeppelin-notebooks",
          "ZEPPELIN_NOTEBOOK_USER":"user"
        },
        "Configurations": [

        ]
      }
    ]
  }
]

I am pasting this object in the Stoftware configuration page of EMR: My question is, how/where I can configure the Spark interpreter directly without the need to manually configure it from Zeppelin each time I start a cluster?

score 11 · Accepted Answer · answered Jul 26 '17 at 14:20

11

This is a bit involved, you will need to do 2 things:

Edit the interpreter.json of Zeppelin
Restart the interpreter

So what you need to do is write a shell script and then add an extra step to the EMR cluster configuration that runs this shell script.

The Zeppelin configuration is in json, you can use jq (a tool) to manipulate json. I don't know what you want to change exactly, but here is an example that adds the (mysteriously missing) DepInterpreter:

#!/bin/bash

# 1 edit the Spark interpreter
set -e
cat /etc/zeppelin/conf/interpreter.json | jq '.interpreterSettings."2ANGGHHMQ".interpreterGroup |= .+ [{"class":"org.apache.zeppelin.spark.DepInterpreter", "name":"dep"}]' | sudo -u zeppelin tee /etc/zeppelin/conf/interpreter.json


# Trigger restart of Spark interpreter
curl -X PUT http://localhost:8890/api/interpreter/setting/restart/2ANGGHHMQ

Put this shell script in a s3 bucket. Then start your EMR cluster with

--steps Type=CUSTOM_JAR,Name=CustomJAR,ActionOnFailure=CONTINUE,Jar=s3://eu-west-1.elasticmapreduce/libs/script-runner/script-runner.jar,Args=[s3://mybucket/script.sh]

answered Jul 26 '17 at 14:20

rdeboo

377
4
11

Great thanks @rdeboo. Can you please elaborate more on what is "2ANGGHHMQ". And can you please provide an example of setting "spark.yarn.executor.memoryOverhead" to 2048 which is my case along with spark.executor.memory and spark.executor.cores – Rami Aug 08 '17 at 09:49
1

@Rami it's some internal key name that identifies the relevant section in interpreter.json. It seems stable (I've looked at many instanced in EMR with different versions). But there are of course no guarantees that this will not change. In any case, I think AWS should just fix the default configuration so we can all stop using this workaround. – rdeboo Aug 14 '17 at 14:18
this is great work! BUT it needed a critical adjustment in my case. restarting the interpreter using the rest API doesn't seem to pick any changes in interpreter.json. Zeppelin itself needs to be restarted, at least this happens on EMR. So instead of curl it worked with: sudo /usr/lib/zeppelin/bin/zeppelin-daemon.sh restart – Radu Simionescu Jan 05 '18 at 19:21
3

turns out "sudo /usr/lib/zeppelin/bin/zeppelin-daemon.sh restart" on EMR is problematic, sometimes. the recommended way is doing "sudo stop zeppelin" and then "sudo start zeppelin" – Radu Simionescu Jan 07 '18 at 01:53

score -3 · Answer 2 · answered Nov 23 '18 at 09:36

-3

I suggest use Terraform to create your cluster there is a command :

configurations_json = "${file("config.json")}"

that can let you inject a json file as a configuration file for your emr cluster

https://www.terraform.io/docs/providers/aws/r/emr_cluster.html

regards

answered Nov 23 '18 at 09:36

Julio

471
5
20

Misses the question: ```My question is, how/where I can configure the Spark interpreter directly without the need to manually configure it from Zeppelin each time I start a cluster?``` – 9bO3av5fw5 Nov 27 '18 at 18:12
and the answer is writ your configurations into a json file and add into the terraform option, i 'm having the same problem and i create a template to configure all configurations (spark, hive, zeppeling, etc) – Julio Nov 28 '18 at 15:45
and what do you write in config.json that alters the contents of `/etc/zeppelin/conf/interpreter.json` – 9bO3av5fw5 Dec 03 '18 at 11:31

Configure Zeppelin's Spark Interpreter on EMR when starting a cluster

2 Answers2

Linked