8

I am creating clusters on EMR and configure Zeppelin to read the notebooks from S3. To do that I am using a json object that looks like that:

[
  {
    "Classification": "zeppelin-env",
    "Properties": {

    },
    "Configurations": [
      {
        "Classification": "export",
        "Properties": {
        "ZEPPELIN_NOTEBOOK_STORAGE":"org.apache.zeppelin.notebook.repo.S3NotebookRepo",
          "ZEPPELIN_NOTEBOOK_S3_BUCKET":"hs-zeppelin-notebooks",
          "ZEPPELIN_NOTEBOOK_USER":"user"
        },
        "Configurations": [

        ]
      }
    ]
  }
]

I am pasting this object in the Stoftware configuration page of EMR: enter image description here My question is, how/where I can configure the Spark interpreter directly without the need to manually configure it from Zeppelin each time I start a cluster?

Rami
  • 8,044
  • 18
  • 66
  • 108

2 Answers2

11

This is a bit involved, you will need to do 2 things:

  1. Edit the interpreter.json of Zeppelin
  2. Restart the interpreter

So what you need to do is write a shell script and then add an extra step to the EMR cluster configuration that runs this shell script.

The Zeppelin configuration is in json, you can use jq (a tool) to manipulate json. I don't know what you want to change exactly, but here is an example that adds the (mysteriously missing) DepInterpreter:

#!/bin/bash

# 1 edit the Spark interpreter
set -e
cat /etc/zeppelin/conf/interpreter.json | jq '.interpreterSettings."2ANGGHHMQ".interpreterGroup |= .+ [{"class":"org.apache.zeppelin.spark.DepInterpreter", "name":"dep"}]' | sudo -u zeppelin tee /etc/zeppelin/conf/interpreter.json


# Trigger restart of Spark interpreter
curl -X PUT http://localhost:8890/api/interpreter/setting/restart/2ANGGHHMQ

Put this shell script in a s3 bucket. Then start your EMR cluster with

--steps Type=CUSTOM_JAR,Name=CustomJAR,ActionOnFailure=CONTINUE,Jar=s3://eu-west-1.elasticmapreduce/libs/script-runner/script-runner.jar,Args=[s3://mybucket/script.sh]
rdeboo
  • 377
  • 4
  • 11
  • Great thanks @rdeboo. Can you please elaborate more on what is "2ANGGHHMQ". And can you please provide an example of setting "spark.yarn.executor.memoryOverhead" to 2048 which is my case along with spark.executor.memory and spark.executor.cores – Rami Aug 08 '17 at 09:49
  • 1
    @Rami it's some internal key name that identifies the relevant section in interpreter.json. It seems stable (I've looked at many instanced in EMR with different versions). But there are of course no guarantees that this will not change. In any case, I think AWS should just fix the default configuration so we can all stop using this workaround. – rdeboo Aug 14 '17 at 14:18
  • this is great work! BUT it needed a critical adjustment in my case. restarting the interpreter using the rest API doesn't seem to pick any changes in interpreter.json. Zeppelin itself needs to be restarted, at least this happens on EMR. So instead of curl it worked with: sudo /usr/lib/zeppelin/bin/zeppelin-daemon.sh restart – Radu Simionescu Jan 05 '18 at 19:21
  • 3
    turns out "sudo /usr/lib/zeppelin/bin/zeppelin-daemon.sh restart" on EMR is problematic, sometimes. the recommended way is doing "sudo stop zeppelin" and then "sudo start zeppelin" – Radu Simionescu Jan 07 '18 at 01:53
-3

I suggest use Terraform to create your cluster there is a command :

configurations_json = "${file("config.json")}"

that can let you inject a json file as a configuration file for your emr cluster

https://www.terraform.io/docs/providers/aws/r/emr_cluster.html

regards

Julio
  • 471
  • 5
  • 20
  • Misses the question: ```My question is, how/where I can configure the Spark interpreter directly without the need to manually configure it from Zeppelin each time I start a cluster?``` – 9bO3av5fw5 Nov 27 '18 at 18:12
  • and the answer is writ your configurations into a json file and add into the terraform option, i 'm having the same problem and i create a template to configure all configurations (spark, hive, zeppeling, etc) – Julio Nov 28 '18 at 15:45
  • and what do you write in config.json that alters the contents of `/etc/zeppelin/conf/interpreter.json` – 9bO3av5fw5 Dec 03 '18 at 11:31