Questions tagged [livy]

Apache Livy is a service that enables easy interaction with a Spark cluster over a REST interface

From http://livy.incubator.apache.org.

What is Apache Livy?

Apache Livy is a service that enables easy interaction with a Spark cluster over a REST interface. It enables easy submission of Spark jobs or snippets of Spark code, synchronous or asynchronous result retrieval, as well as Spark Context management, all via a simple REST interface or a RPC client library. Apache Livy also simplifies the interaction between Spark from application servers, thus enabling the use of Spark for interactive web/mobile applications. Additional features include:

  • Have long running Spark Contexts that can be used for multiple Spark jobs, by multiple clients
  • Share cached RDDs or Dataframes across multiple jobs and clients
  • Multiple Spark Contexts can be managed simultaneously, and the Spark Contexts run on the cluster (YARN/Mesos) instead of the Livy Server, for good fault tolerance and concurrency
  • Jobs can be submitted as precompiled jars, snippets of code or via java/scala client API
  • Ensure security via secure authenticated communication

References

288 questions
5
votes
0 answers

What is the difference between livy.rsc.jars and livy.repl.jars?

I'm working on Jupyter Notebooks using sparkmagic kernel (spark-scala) which relies on Apache Livy to run spark jobs. I'm currently trying to understand the options to create sessions with user-provided dependencies i.e., jars. I know in Jupyter I…
Ohtar10
  • 128
  • 1
  • 11
5
votes
0 answers

Storing Python packages in HDFS for Livy PySpark

I am submitting PySpark jobs to the cluster through Livy. Currently the dependent python packages like NumPy, Pandas, Keras etc are installed on all the datanodes. Was wondering if all of these packages can be stored centrally in HDFS and how can…
5
votes
2 answers

How do I run Spark jobs concurrently in the same AWS EMR cluster ?

Is it possible to submit and run Spark jobs concurrently in the same AWS EMR cluster ? If yes then could you please elaborate ?
5
votes
0 answers

Bad Request: "requirement failed: Session isn't active." in Apache livy

code: public class PiApp { public static void main(String[] args) throws Exception { LivyClient client = new LivyClientBuilder().setURI(new URI("http://localhost:8998/")).build(); try { System.out.println("Uploading livy-example…
Pare
  • 55
  • 6
5
votes
1 answer

Livy REST API: GET requests work but POST requests fail with '401 Authentication required'

I’ve written a Java client for parts of Livy’s REST API at https://github.com/apache/incubator-livy/blob/master/docs/rest-api.md. The client uses Spring’s RestTemplate.getForObject() and postForObject() to make GET and POST requests respectively.…
snark
  • 2,462
  • 3
  • 32
  • 63
5
votes
1 answer

Why is Apache Livy session showing Application id NULL?

I've implemented a fully functional Spark 2.1.1 Standalone cluster, where I POST job batches via the curl command using Apache Livy 0.4. When consulting the Spark WEB UI I see my job along with its application id (something like this:…
5
votes
3 answers

How to kill spark/yarn job via livy

I am trying to submit spark job via livy using rest api. But if I run same script multiple time it runs multiple instance of a job with different job ID's. I am looking a way to kill spark/yarn job running with same name before starting a new one.…
roy
  • 6,344
  • 24
  • 92
  • 174
5
votes
1 answer

sparklyr livy connection with Kerberos

I'm able to connect to non-Kerberized spark cluster through Livy service without problems from a remote Rstudio desktop (windows). However, if the Kerberos security is enabled, the connection fails: library(sparklyr) sc <-…
runr
  • 1,142
  • 1
  • 9
  • 25
4
votes
0 answers

PySpark virtual environment archive on S3

I'm trying to deploy PySpark applications to an EMR cluster that have various, differing, third-party dependencies, and I am following this blog post, which describes a few approaches to packaging a virtual environment and distributing that across…
user4601931
  • 4,982
  • 5
  • 30
  • 42
4
votes
1 answer

YARN doesn't recognize increased 'yarn.scheduler.maximum-allocation-mb' and 'yarn.nodemanager.resource.memory-mb' values

I'm working with a dockerized pyspark cluster which utilizes yarn. To improve the efficieny of the data processing pipelines I want to increase the amount of memory allocated to the pyspark executors and the driver. This is done by adding the…
MilkSilk
  • 43
  • 5
4
votes
2 answers

How to pull Spark jobs client logs submitted using Apache Livy batches POST method using AirFlow

I am working on submitting Spark job using Apache Livy batches POST method. This HTTP request is send using AirFlow. After submitting job, I am tracking status using batch Id. I want to show driver ( client logs) logs on Air Flow logs to avoid going…
Ramdev Sharma
  • 974
  • 1
  • 12
  • 17
4
votes
1 answer

Spark job submission using Airflow by submitting batch POST method on Livy and tracking job

I want to use Airflow for orchestration of jobs that includes running some pig scripts, shell scripts and spark jobs. Mainly on Spark jobs, I want to use Apache Livy but not sure whether it is good idea to use or run spark-submit. What is best way…
Ramdev Sharma
  • 974
  • 1
  • 12
  • 17
4
votes
2 answers

Livy No YARN application is found with tag livy-batch-10-hg3po7kp in 120 seconds

Used Livy to execute a script stored in S3 via a POST request launched from EMR. The script runs but it times out very quickly. I have tried editing the livy.conf configurations, but none of the changes seem to stick. This is the error that is…
Aaron Liang
  • 41
  • 1
  • 4
4
votes
2 answers

Java application with Apache Livy

I decided to build a web service(app) for Apache Spark with Apache Livy. Livy server is up and running on localhost port 8998 according to Livy configuration defaults. My test program is a sample application in Apache Livy documentation: …
4
votes
1 answer

Setting spark.local.dir in Pyspark/Jupyter

I'm using Pyspark from a Jupyter notebook and attempting to write a large parquet dataset to S3. I get a 'no space left on device' error. I searched around and learned that it's because /tmp is filling up. I want to now edit spark.local.dir to point…
c3p0
  • 125
  • 2
  • 11
1
2
3
19 20