0

I quite new to spark and started with pyspark, I am learning to push data from kafka to hive using pyspark.

from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import *
from pyspark.streaming.kafka import KafkaUtils
from os.path import abspath

warehouseLocation = abspath("spark-warehouse")

spark = SparkSession.builder.appName("sparkstreaming").getOrCreate()

df = spark.read.format("kafka").option("startingoffsets", "earliest").option("kafka.bootstrap.servers", "kafka-server1:66,kafka-server2:66").option("kafka.security.protocol", "SSL").option("kafka.ssl.keystore.location", "mykeystore.jks").option("kafka.ssl.keystore.password","mykeystorepassword").option("subscribe","json_stream").load().selectExpr("CAST(value AS STRING)")

json_schema = df.schema

df1 = df.select($"value").select(from_json,json_schema).alias("data").select("data.*")

The above is not working, however after extracting data, I want to insert data to hive table.

As I am completely new, looking for help. Appreciated in advance! :)

Jay
  • 1

1 Answers1

0
from os.path import expanduser, join, abspath

from pyspark.sql import SparkSession
from pyspark.sql import Row

# warehouse_location points to the default location for managed databases and tables
warehouse_location = abspath('spark-warehouse')

spark = SparkSession \
.builder \
.appName("Python Spark SQL Hive integration example") \
.config("spark.sql.warehouse.dir", warehouse_location) \
.enableHiveSupport() \
.getOrCreate()

# spark is an existing SparkSession
spark.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING) USING hive")
spark.sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src")
Jay Kakadiya
  • 501
  • 1
  • 5
  • 12
  • my value in kafka value is a stream of json, say {"foo":"foo", "bar":"bar","table_name":"mytable"} {"barfoo":"barfoo","foobar":"foobar","table_name":"mytable"} Once I do df.selectExpr("CAST (value AS STRING)"), how can I insert data in to hive table "mytable", can you please help – Jay Feb 16 '20 at 23:06
  • I edit the code please review and implement if you still facing some issue please let me know – Jay Kakadiya Feb 17 '20 at 02:17
  • @ Jay Kakadiya do you mean to save the data from kafka to hdfs location and then load it to staging table to insert in to real tables, please let me know if my understanding is not correct – Jay Feb 20 '20 at 09:26