Unable to script a replicated, persistent table from a CSV file

Question

SnappyData v.0-5

Goal: I want to create a persistent, replicated ROAD table and load it from a CSV file using the Snappy Shell. The ROAD table should have 'road_id' as a primary key to prevent duplicate IDs.

The commands I tried are:

SET SCHEMA A;

DROP TABLE IF EXISTS ROAD;
DROP TABLE IF EXISTS STAGING_ROAD;

CREATE TABLE STAGING_ROAD 
(road_id string, name string)
USING com.databricks.spark.csv
OPTIONS(path 'roads.csv', header 'true');

CREATE TABLE ROAD
(
    road_id VARCHAR(64) NOT NULL,
    name VARCHAR(64) NOT NULL,
    CONSTRAINT road_PK PRIMARY KEY (road_id)
) USING row OPTIONS (BUCKETS '5', REPLICATE, PERSISTENT)
AS (select road_id, name from STAGING_ROAD);

This does not work. And, I have to dumb down my ROAD create to this to make it even get created. This does not have a PK. It does not have replication or persistence.

CREATE TABLE ROAD USING row OPTIONS ()
AS (select road_id, name from STAGING_ROAD);

How should I be scripting a SnappyData SQL file to accomplish the goal stated above?

suranjan · Answer 1 · 2016-07-26T15:13:41.977

You can achieve the same using following script:

CREATE TABLE STAGING_ROAD USING com.databricks.spark.csv OPTIONS(path 'roads.csv', header 'true',inferSchema 'true');

CREATE TABLE STAGING_ROAD2 USING row AS (SELECT road_id,name FROM STAGING_ROAD);

CREATE TABLE ROAD
(
    road_id VARCHAR(64) NOT NULL PRIMARY KEY,
    name VARCHAR(64) NOT NULL
) USING row OPTIONS (PARTITION_BY 'road_id', BUCKETS '5', PERSISTENT 'ASYNCHRONOUS');

INSERT INTO ROAD SELECT road_id, name from STAGING_ROAD2;

DROP TABLE STAGING_ROAD2;

The creation of STAGING_ROAD2 is due to a bug which will not be required on latest build.

Programmatically, you can achieve it in simpler way, like following

String roadCsvFile = "road.csv";

snc.sql("CREATE TABLE ROAD( road_id VARCHAR(64) NOT NULL PRIMARY KEY,  name VARCHAR(64) NOT NULL) USING row OPTIONS (PARTITION_BY 'road_id', BUCKETS '5', PERSISTENT 'ASYNCHRONOUS')");

DataFrame roadDf = snc.read()
         .format("com.databricks.spark.csv") // CSV to DF package
         .option("header", "true") // Use first line of all files as header
         .option("inferSchema", "true") // Automatically infer data types
         .load(roadCsvFile);

// Populate the table in snappy store
roadDf.write().insertInto("ROAD");

I'm not clear on what SRC_SYS and TRD_DATE do. I also don't understand how all the rows from the CSV actually get into the final table. — Jason, Jul 26 '16 at 14:39
Rows from CSV is loaded to STAGING_ROAD and then to STAGING_ROAD2. Finally, it is inserted to the final table with all the constraints etc. — suranjan, Jul 26 '16 at 15:15

Rishitesh Mishra · Answer 2 · 2016-07-26T12:09:35.330

Couple of points here. As you want to put some constraints, you need to create the table first. Then you can insert data into the table.

a) CREATE TABLE ROAD (road_id VARCHAR(64) NOT NULL, name VARCHAR(64) NOT NULL, CONSTRAINT road_PK PRIMARY KEY (road_id)) USING row OPTIONS (PERSISTENT ''); You don't have to specify the REPLICATE key word ( not needed, default is replicate) or the BUCKET parameter(which is only for partitioned tables).

b) Ideally it should have been INSERT INTO ROAD (SELECT * FROM STAGING_ROAD). Unfortunately we don't have the SQL support for bulk inserts from external data like csv,parquet in 0.5 release , which I assume you are using. This has been taken care in latest master and will be available in subsequent releases.

c) Workaround is to write a Spark job and insert into the created row table. You can refer the http://snappydatainc.github.io/snappydata/jobs/ section of document.

Let me know if you need some other info.

Unable to script a replicated, persistent table from a CSV file

The commands I tried are:

2 Answers2