Questions tagged [snappydata]

SnappyData is an open source integration of the GemFireXD in-memory database and the Apache Spark cluster computing system for OLTP, OLAP, and Approximate Query Processing workloads.

From https://github.com/SnappyDataInc/snappydata

SnappyData is a distributed in-memory data store for real-time operational analytics, delivering stream analytics, OLTP (online transaction processing) and OLAP (online analytical processing) in a single integrated, highly concurrent, highly available cluster. This platform is realized through a seamless integration of (as a big data computational engine) with GemFireXD (as an in-memory transactional store with scale-out SQL semantics).

Within SnappyData, GemFireXD runs in the same JVM Spark executors run on. This allows for optimal performance in moving data in and out of Spark executors as well as making the overall architecture simpler. All Spark jobs should run in SnappyData though the SnappyData database can also be accessed using SQL via ODBC/JDBC, Thrift, REST without needing to go through Spark.

SnappyData packages Approximate Query Processing (AQP) technology. The basic idea behind AQP is that one can use statistical sampling techniques and probabilistic data structures to answer aggregate class queries without needing to store or operate over the entire data set. This approach trades off query accuracy for quicker response times, allowing for queries to be run on large data sets with meaningful and accurate error information. A real world example here would be the use of political polls run by Gallup and others where a small sample is used to estimate support for a candidate within a small margin of error.

It's important to note that not all SQL queries can be answered through AQP, but by moving a subset of queries hitting the database to the AQP module, the system as a whole becomes more responsive and usable.

Important links:

The SnappyData Github Repo

SnappyData public Slack/Gitter/IRC Channels

SnappyData technical paper

SnappyData Documentation

SnappyData ScalaDoc

SnappyData Screencasts

132 questions
1
vote
1 answer

How to set up SnappyData cluster with any host other than localhost?

When I set up the snappydata cluster with all locators, services and leads running in the same machine and the host names are specified as "localhost" - then I can see the service comes up. With the same set up when I replace localhost with the…
Subhendu
  • 41
  • 6
1
vote
1 answer

Spark Structured Streaming supported by SnappyData

I've just learned about SnappyData (and watched some videos about it), and it looks interesting mainly when says that the performance might be many times faster than a regular spark job. Could the following code (snippet) leverage on the SnappyData…
Kleyson Rios
  • 2,597
  • 5
  • 40
  • 65
1
vote
0 answers

Dependencies and Includes for SnappyData Jobs

What do I add to my SBT and include in my Scala class header to build a SnappyJob to use via snappy-job.sh submit? I'm attempting to do some basic "Hello World" work in the form of a SnappyData job, before experimenting with building a job combined…
Joseph Pride
  • 165
  • 11
1
vote
1 answer

Import complex data type(Array) from csv to snappydata

From my previous question I came to know the way one can insert array. Now I want to insert a latge amount of data into the table. From this reference of snappydata i was able to import a large amount into the tables. But when i tried to import…
techie95
  • 515
  • 3
  • 16
1
vote
1 answer

How to store Array or Blob in SnappyData?

I'm trying to create a table with two columns like below: CREATE TABLE test (col1 INT ,col2 Array) USING column options(BUCKETS '5'); It is creating successfully but when i'm trying to insert data into it, it is not accepting any format of…
techie95
  • 515
  • 3
  • 16
1
vote
1 answer

SnappyData Smart Connector - how to run jobs

I'm reading the documentation and I would like to ask you to help me understand the SnappyData Smart Connector point. There is a few different examples in documentation how should I use spark-submit e.g: example 1 ./bin/spark-submit --deploy-mode…
Tomtom
  • 91
  • 1
  • 1
  • 9
1
vote
2 answers

SnappyData: What to Put in build.sbt and import Statement so I Can Use SnappySession

I'm working my way up to a "Hello World" kind of SnappyData application, which I would like to be able to build and run in IntelliJ. My cluster so far is one locator, one lead, and one server on the local machine. I just want to connect to it,…
Joseph Pride
  • 165
  • 11
1
vote
2 answers

SnappyData + Zeppelin + Kafka streaming - error while creating streaming table

I'm trying to create SnappyData streaming table using Zeppelin. I have issue with stream table definition on argument 'rowConverter' Zeppelin notebook is divided to a few paragraphs: Paragraph 1: import org.apache.spark.sql.Row import…
Tomtom
  • 91
  • 1
  • 1
  • 9
1
vote
1 answer

SnappyData submit a jar to cluster with parameters

SnappyData documentation give an example on how to submit a jar to a cluster: https://snappydatainc.github.io/snappydata/howto/run_spark_job_inside_cluster/ But what if I need to submit the jar with the same class CreatePartitionedRowTable multiple…
user3230153
  • 123
  • 3
  • 11
1
vote
1 answer

SnappyData : java.lang.OutOfMemoryError: GC overhead limit exceeded

I have 1.2GB of orc data on S3 and I am trying to do the following with the same : 1) Cache the data on snappy cluster [snappydata 0.9] 2) Execute a groupby query on the cached dataset 3) Compare the performance with Spark 2.0.0 I am using a 64 GB/8…
Harsh Bafna
  • 2,094
  • 1
  • 11
  • 21
1
vote
1 answer

SnappyData multiple jobs to achieve parallelism

I am using Snappydata and SQL to run some analysis, however the job is slow and involves join operations on very large input data. I am considering partition the input data first, then run the jobs on different partitions at the same time to speed…
user3230153
  • 123
  • 3
  • 11
1
vote
2 answers

Snappydata SQL WITH statement

I am using Snappydata to run some queries, and use the sql with statement: WITH x AS ( SELECT DISTINCT col_a, col_b FROM table_a ) INSERT INTO table_b SELECT x.col_a, x.col_b FROM x JOIN table_c c ON x.col_a = c.col_a and x.col_b =…
user3230153
  • 123
  • 3
  • 11
1
vote
0 answers

How to connect to spark (CDH-5.8 docker vms at remote)? Do I need to map port 7077 at container?

Currently, I can access the HDFS from inside my application, but I'd also like to - instead of running my local spark - to use Cloudera's spark as it is enabled in Cloudera Manager. Righ now I have the HDFS defined at core-site.xml, and I run my…
fhorta
  • 91
  • 2
  • 7
1
vote
1 answer

SnappyData as the Operational Database. Is it recommended?

I am testing databases for a new application where I will have to browse and index millions of xmls files and subsequently generate analysis of these data. I would use SnappyData in this project. However, I do not know how it works. Is it…
1
vote
2 answers

Can SnappyData database schema coexist with hive metadata store?

I have created a database schema with a few row-base tables in SnappyData 0.9 without hive metadata store connected. Later, I add the hive.metastore.uris property in hive-site.xml file and have SnappyData connect to it. To my surprise, the lead…
Caleb S
  • 11
  • 1
1 2
3
8 9