How do I load gziped json data into table, using Spark SQL's CREATE TABLE Query

Question

I want to connect Apache Superset with Apache Spark (I have Spark 3.1.2) and Query the data on Superset's SQL Lab using Apache Spark SQL.

On spark's master, I started thrift server using this command spark-submit --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.

Then I added Spark cluster as a database in Superset using SQLAlchemy URI hive://hive@spark:10000/. I am able to access Spark cluster on Superset.

I can load JSON data as table using this SQL

CREATE table IF NOT EXISTS test_table
USING JSON
LOCATION "/path/to/data.json"

and I am able to Query data using simple SQL statements like SELECT * FROM test_table LIMIT 10

BUT the problem is that json data is compressed as gzipped files.

So I tried

CREATE table IF NOT EXISTS test_table
USING JSON
LOCATION "/path/to/data.json.gz"

but it did not work. I want to know how do load gzipped json data into a table

Data at rest is Newline separated json compressed as gzipped. — Rahul Prasad, Oct 10 '21 at 17:05

Shwetha · Answer 1 · 2021-10-19T09:33:43.563

Compressed JSON storage

If you have large JSON text you can explicitly compress JSON text using built-in COMPRESS function. In the following example compressed JSON content is stored as binary data, and we have computed column that decompress JSON as original text using DECOMPRESS function:

CREATE TABLE Person

( _id int identity constraint PK_JSON_ID primary key,

data varbinary(max),

value AS CAST(DECOMPRESS(data) AS nvarchar(max))

)



INSERT INTO Person(data)

VALUES (COMPRESS(@json))

COMPRESS and DECOMPRESS functions use standard GZip compression.

Another example:

CREATE EXTENSION json_fdw;
postgres=# CREATE SERVER json_server FOREIGN DATA WRAPPER json_fdw;

postgres=# CREATE FOREIGN TABLE customer_reviews
(
    customer_id TEXT,
    "review.date" DATE,
    "review.rating" INTEGER,
    "product.id" CHAR(10),
    "product.group" TEXT,
    "product.title" TEXT,
    "product.similar_ids" CHAR(10)[]
)
SERVER json_server
OPTIONS (filename '/home/citusdata/customer_reviews_nested_1998.json.gz');

Note: This example was taken from https://www.citusdata.com/blog/2013/05/30/run-sql-on-json-files-without-any-data-loads

How do I load gziped json data into table, using Spark SQL's CREATE TABLE Query

1 Answers1