2

Environment: Scala and spark 1.6

I am trying to -
1. getting json data through Rest API call
2. write in HDFS as json file 3. Convert json file into dataframe

val rawdata = "curl http://services.groupkt.com/state/get/USA/all"!!
println(rawdata)  // can see json output, but can't save as file in HDFS

I can see output on screen, but how can I write contents of rawdata to hdfs url (hdfs://quickstart.cloudera:8020/user/hive/warehouse/test/)? Or is there any way to purse content of rawdata without saving as file? I also need to convert json to dataframe.

Thanks in advance
Hossain

Jhon
  • 137
  • 1
  • 4
  • 14

1 Answers1

2
val rawdata = "curl http://services.groupkt.com/state/get/USA/all"!!
println(rawdata) 

Once you have the data, you can use the code from this answer to save it in Hadoop.

Creating Dataframe:

Suppose your json string is something like this:

{"time":"sometext1","host":"somehost1","event":  {"category":"sometext2","computerName":"somecomputer1"}}

you can convert the json into dataframe from the following code:

// Creating Rdd    
val vals = sc.parallelize(
  """{"time":"sometext1","host":"somehost1","event":  {"category":"sometext2","computerName":"somecomputer1"}}""" ::
    Nil)

// Creating Schema   
val schema = (new StructType)
  .add("time", StringType)
  .add("host", StringType)
  .add("event", (new StructType)
    .add("category", StringType)
    .add("computerName", StringType))

import sqlContext.implicits._
val jsonDF = sqlContext.read.schema(schema).json(vals)

After creating dataframe you can still have an option to save it in hadoop by using spark-csv lib or by using saveAsTextFile method on RDD.

Community
  • 1
  • 1
bob
  • 4,595
  • 2
  • 25
  • 35