Reading csv files in zeppelin using spark-csv

Question

I wanna read csv files in Zeppelin and would like to use databricks' spark-csv package: https://github.com/databricks/spark-csv

In the spark-shell, I can use spark-csv with

spark-shell --packages com.databricks:spark-csv_2.11:1.2.0

But how do I tell Zeppelin to use that package?

Thanks in advance!

https://zeppelin.incubator.apache.org/docs/interpreter/spark.html#dependencyloading — zero323, Oct 06 '15 at 10:03
ok, added: %dep --packages com.databricks:spark-csv_2.11:1.2.0 to a zeppeling notebook, but gave: "Must be used before SparkInterpreter (%spark) initialized". Haven't used %spark in the notebook however — fabsta, Oct 06 '15 at 14:07
You can also try: `ZEPPELIN_JAVA_OPTS="-Dspark.jars=/path/to/spark-csv"` — zero323, Oct 07 '15 at 06:58
@fabsta: were you able to solve the the ""Must be used before SparkInterpreter (%spark) initialized"". If not, the answer is to restart the interpreter(Interpreter tab and then restart the spark interpreter), along with Samuel's answer. I did not have to use z.reset() though. — RAbraham, Dec 24 '15 at 15:22

score 15 · Accepted Answer · answered Jan 08 '16 at 16:22

15

You need to add the Spark Packages repository to Zeppelin before you can use %dep on spark packages.

%dep
z.reset()
z.addRepo("Spark Packages Repo").url("http://dl.bintray.com/spark-packages/maven")
z.load("com.databricks:spark-csv_2.10:1.2.0")

Alternatively, if this is something you want available in all your notebooks, you can add the --packages option to the spark-submit command setting in the interpreters config in Zeppelin, and then restart the interpreter. This should start a context with the package already loaded as per the spark-shell method.

answered Jan 08 '16 at 16:22

Simon Elliston Ball

4,375
1
21
18

1

%dep is now depricated (0.6.1)... see Paul's answer (use the GUI) – Pete Aug 31 '16 at 14:38
True. This should now be done in the interpreter configuration as Paul below states. – Simon Elliston Ball Sep 10 '16 at 09:20
%dep should not anymore be considered depreciated at this time. See Paul-Armand's answer for the reasons why. – Paul-Armand Verhaegen Nov 07 '16 at 07:14
This approach should not be used because jars added using this approach will not be distributed to the spark executors. – Harvinder Singh Apr 19 '18 at 10:29
@HarvinderSingh are you sure about that? This approach *does* distribute the jars as part of the way it submits to spark, which is why these notes have to be run before the spark interpreters (at least in the older versions of zeppelin I tested on). That said, the other approaches suggested here provide a cleaner alternative on newer versions. – Simon Elliston Ball Apr 20 '18 at 12:51

Paul-Armand Verhaegen · Answer 2 · 2016-11-07T07:12:17.313

8

Go to the Interpreter tab, click Repository Information, add a repo and set the URL to http://dl.bintray.com/spark-packages/maven
Scroll down to the spark interpreter paragraph and click edit, scroll down a bit to the artifact field and add "com.databricks:spark-csv_2.10:1.2.0" or a newer version. Then restart the interpreter when asked.

In the notebook, use something like:

import org.apache.spark.sql.SQLContext

val sqlContext = new SQLContext(sc)
val df = sqlContext.read
    .format("com.databricks.spark.csv")
    .option("header", "true") // Use first line of all files as header
    .option("inferSchema", "true") // Automatically infer data types
    .load("my_data.txt")

Update:

In the Zeppelin user mailing list, it is now (Nov. 2016) stated by Moon Soo Lee (creator of Apache Zeppelin) that users prefer to keep %dep as it allows for:

self-documenting library requirements in the notebook;
per Note (and possible per User) library loading.

The tendency is now to keep %dep, so it should not be considered depreciated at this time.

edited Nov 07 '16 at 07:12

answered Aug 10 '16 at 09:44

Paul-Armand Verhaegen

537
5
9

I'm not sure what you mean by "create a repo". In my Zeppelin interpreter tab I can create a full new interpreter environment. Plus, I have that Spark packages URL against the field `zeppelin.dep.additionalRemoteRepository`, so how should I exactly make it load the CSV package? – mar tin Oct 12 '16 at 14:20
@martin Create a repo (repository) by clicking on the gear icon to the left of the "create" button (for creating a full new interpreter environment which is not what you want). This should expand the available repository list, and reveal a "+" button. Click on the "+" button and add http://dl.bintray.com/spark-packages/maven as URL. You can then just follow Steps 2 and 3. As for your other question, it is normal to have that URL in zeppelin.dep.additionalRemoteRepository . This is a dependency can now be resolved since the external repo is added in Step 1. – Paul-Armand Verhaegen Oct 14 '16 at 06:36

score 4 · Answer 3 · edited Nov 02 '16 at 22:22

4

BEGIN-EDIT

%dep is deprecated in Zeppelin 0.6.0. Please refer Paul-Armand Verhaegen's answer.

Please read further in this answer, if you are using zeppelin older than 0.6.0

END-EDIT

You can load the spark-csv package using %dep interpreter.

like,

%dep
z.reset()

// Add spark-csv package
z.load("com.databricks:spark-csv_2.10:1.2.0")

See Dependency Loading section in https://zeppelin.incubator.apache.org/docs/interpreter/spark.html

If you've already initialized Spark Context, quick solution is to restart zeppelin and execute zeppelin paragraph with above code first and then execute your spark code to read the CSV file

edited Nov 02 '16 at 22:22

Paul-Armand Verhaegen

537
5
9

answered Oct 08 '15 at 09:53

sag

5,333
8
54
91

When I try that, I get "error: not found: value % %dep". Ideas? – fabsta Oct 30 '15 at 14:53
No, But what version of zeppelin are you using? I just built it from source and everything is fine – sag Oct 31 '15 at 04:39
1

As Pete mentioned above: dep is now depricated (0.6.1)... see Paul's answer (use the GUI) – conradlee Sep 29 '16 at 09:11
1

%dep should not anymore be considered depreciated at this time. See Paul-Armand's answer for the reasons why. – Paul-Armand Verhaegen Nov 07 '16 at 07:14

score 1 · Answer 4 · answered Mar 21 '18 at 13:49

1

You can add jar files under Spark Interpreter dependencies:

Click 'Interpreter' menu in navigation bar.
Click 'edit' button for Spark interpreter.
Fill artifact and exclude fields.
Press 'Save'

answered Mar 21 '18 at 13:49

Gilad

121
1
8

score 0 · Answer 5 · edited Jun 20 '20 at 09:12

if you define in conf/zeppelin-env.sh

export SPARK_HOME=<PATH_TO_SPARK_DIST>

Zeppelin will then look in $SPARK_HOME/conf/spark-defaults.conf and you can define jars there:

spark.jars.packages                com.databricks:spark-csv_2.10:1.4.0,org.postgresql:postgresql:9.3-1102-jdbc41

then look at

http://zepplin_url:4040/environment/ for the following:

spark.jars file:/root/.ivy2/jars/com.databricks_spark-csv_2.10-1.4.0.jar,file:/root/.ivy2/jars/org.postgresql_postgresql-9.3-1102-jdbc41.jar

spark.jars.packages com.databricks:spark-csv_2.10:1.4.0,org.postgresql:postgresql:9.3-1102-jdbc41

For more reference: https://zeppelin.incubator.apache.org/docs/0.5.6-incubating/interpreter/spark.html

score 0 · Answer 6 · answered Oct 28 '16 at 02:14

0

Another solution:

In conf/zeppelin-env.sh (located in /etc/zeppelin for me) add the line:

export SPARK_SUBMIT_OPTIONS="--packages com.databricks:spark-csv_2.10:1.2.0"

Then start the service.

answered Oct 28 '16 at 02:14

Zack

1,201
8
21

Reading csv files in zeppelin using spark-csv

6 Answers6

Linked

Related