4

I am running zeppelin 0.7.0 on an emr-5.4.0 cluster. I am starting the cluster with the default settings. The %spark.dep interpreter doesn't get configured by EMR.

I have edited the file /etc/zeppelin/conf/interpreter.json from the below:

"2ANGGHHMQ": {
  "id": "2ANGGHHMQ",
  "name": "spark",
  "group": "spark",
  "properties": {
    "spark.yarn.jar": "",
    "zeppelin.spark.printREPLOutput": "true",
    "master": "yarn-client",
    "zeppelin.spark.maxResult": "1000",
    "spark.app.name": "Zeppelin",
    "zeppelin.spark.useHiveContext": "true",
    "args": "",
    "spark.home": "/usr/lib/spark",
    "zeppelin.spark.concurrentSQL": "false",
    "zeppelin.spark.importImplicit": "true",
    "zeppelin.pyspark.python": "python",
    "zeppelin.dep.localrepo":"/usr/lib/zeppelin/local-repo"
  },
  "interpreterGroup": [
    {
      "class": "org.apache.zeppelin.spark.SparkInterpreter",
      "name": "spark"
    },
    {
      "class": "org.apache.zeppelin.spark.PySparkInterpreter",
      "name": "pyspark"
    },
    {
      "class": "org.apache.zeppelin.spark.SparkSqlInterpreter",
      "name": "sql"
    }
  ],
  "option": {
    "remote": true,
    "port": -1,
    "perNoteSession": false,
    "perNoteProcess": false,
    "isExistingProcess": false
  }
}

I have to manually add the following and restart zeppelin:

{
  "class":"org.apache.zeppelin.spark.DepInterpreter",
  "name": "dep"
}

Is there a way to make EMR use the default zeppelin settings (and not remove this config)?

UPDATE

Could someone also explain why the cluster I have just created this morning, by cloning the original cluster, has a completely different config?

"interpreterGroup": [
    {
      "name": "spark",
      "class": "org.apache.zeppelin.spark.SparkInterpreter",
      "defaultInterpreter": false,
      "editor": {
        "language": "scala",
        "editOnDblClick": false
      }
    },
    {
      "name": "pyspark",
      "class": "org.apache.zeppelin.spark.PySparkInterpreter",
      "defaultInterpreter": false,
      "editor": {
        "language": "python",
        "editOnDblClick": false
      }
    },
    {
      "name": "sql",
      "class": "org.apache.zeppelin.spark.SparkSqlInterpreter",
      "defaultInterpreter": false,
      "editor": {
        "language": "sql",
        "editOnDblClick": false
      }
    }
  ]
roblovelock
  • 1,971
  • 2
  • 23
  • 41
  • Thanks for sharing the manual method for this. I think this is a major oversight that this interpreter is not there, how else could we add external packages? I'm not sure how AWS think Zeppelin will be useful without that ability. – Davos Apr 12 '17 at 05:47
  • In this page https://community.hortonworks.com/questions/41537/adding-libraries-to-zeppelin.html There is a suggestion that you can use the local.repo to store jar files. I'm not sure exactly how that would work though, whether the path to the dependency needs to be added to the Spark (or other) interpreter, or if simply having the jar in the local repo is enough to then import it in your code – Davos Apr 12 '17 at 06:47
  • see answer here https://stackoverflow.com/questions/45328671/configure-zeppelins-spark-interpreter-on-emr-when-starting-a-cluster – Radu Simionescu Jan 05 '18 at 18:08

1 Answers1

4

As per AWS, cloning a cluster only clones the basic configuration and not the changes that you have made after creating it. Also, there is no configuration API in EMR that allows you to change Zeppelin's interpreter.json file so the only way is to change the configuration manually at the moment.

Zeppelin does seem to have set of REST APIs that allow you to change interpreter settings. Especially this API endpoint which allows you to create interpreter settings. However, that does not seem to work with following payload:

POST : http://[zeppelin-server]:[zeppelin-port]/api/interpreter/setting

Payload:

{
  "name": "dep",
  "group": "spark",
  "properties": {},
  "interpreterGroup": [
    {
       "class":"org.apache.zeppelin.spark.DepInterpreter",
       "name": "dep",
       "defaultInterpreter": true
    }
  ],
  "dependencies": []
}

So, the only option is to manually change interpreter.json at the moment. Should the above endpoint work, you can add it into Cluster creating step as explained here.

Darshan Mehta
  • 30,102
  • 11
  • 68
  • 102
  • 1
    Running `curl -vX POST http://localhost:8890/api/interpreter/setting -d @payload.json --header "Content-Type: application/json"` using above payload works on an EMR cluster. It'll respond with a `CREATING` status and will be available after a minute. – kadrach Feb 20 '18 at 10:17