How to use embeded spark with existing SnappyData

Question

I have used snappy-sql and there I created some tables and did some inserts and queries... everything ok

Then, as I need to import a lot of data from csv file I created a scala script that read each of the files and extract the data and tries to insert into the database

For this I am using the spark that comes with snappydata, I connect using

./bin/spark-shell --conf spark.snappydata.store.sys-disk-dir=snappydatadir --conf spark.snappydata.store.log-file=snappydatadir/quickstart.log

The directory exists and everything "runs" "ok"... (not quite true)

Here is the problem... when I try to do queries over the tables I created in snappy-sql... the spark-shell tells me that the tables do not exist... and when the script gets to the insert command happens the same

So my question, as I am a newbi...

How do I connect to that spark-shell (snappydir/bin/spark-shell...) and use the tables that already exists in snappydata??

I bet I am nod adding some specific configuration...

Thanks for the help... as I said... I am less than basic in snappydata and spark so I am feeling a little lost trying to configure and set-up my environment...

score 0 · Answer 1 · answered Jul 05 '17 at 09:19

To import CSV data into SnappyData you can create an external tables and use 'insert into' command to import the data. You don't have to have to use spark-shell, just use snappy-sql shell. For example:

-- this assumes locator is running on the localhost snappy> connect client 'localhost:1527'; -- create an external table snappy> CREATE EXTERNAL TABLE table1_staging (col1 int, col2 int, col3 string) USING csv OPTIONS (path ‘path_to_csv_file’); --import data into table1 snappy>insert into table1 select * from table1_staging;

path_to_csv_file should be available on all servers. Refer to how-to section in docs for other ways to import data.

To use spark-shell, make use of property snappydata.connection to connect to SnappyData cluster. The value of this property would be of the form 'locatorHost:clientConnectionPort' (default client connection port is 1527)

For example:

bin/spark-shell --master local[*] --conf spark.snappydata.connection=localhost:1527 --conf spark.ui.port=4041

Refer to the documentation for more details about how to connect using spark-shell.

Hello, thanks for your answer... but... what about having to unzip thousands of files and process thousands of csvs that are stored into that zip files? So I have already created a script that does that from spark/scala and was using "val tablename_DF = snSession.read.csv(tablenamecsv); tablename_DF.write.insertInto("tablename")"... I will try to use your code with snSession.sql and test the performance... but it still do not answer my question... when I connect from snappy-sql I do not see the changes made from snappydir/bin/spark-shell... — Mauricio Chica Patiño, Jul 05 '17 at 21:31
Example you have pasted for use of spark-shell does not make use of "spark.snappydata.connection" parameter. So its not actually connecting to SnappyData. That might be a reason why you are not seeing tables. — Shirish Deshmukh, Jul 06 '17 at 05:36

How to use embeded spark with existing SnappyData

1 Answers1