2

I am new to Spark. I can load the .json file in Spark. What if there are thousands of .json files in a folder. picture of .json files in the folder

And I have a csv file, which classifies the .json files with labels.picture of csv file

What should I do with Spark if I want to load and save the data.(for example.I want to load the first information in csv, but it is text information. But it gives the path of .json, and I want to load the .json, then save the output. So I will know the first Trusted label graph's json information.)

Fengyu
  • 35
  • 2
  • 6

1 Answers1

1

For the JSON:

jsonRDD = sql_context.read.json("path/to/json_folder/");

For CSV install spark-csv from here Databricks' spark-csv

csvRDD = sql_context.read.load("path/to/csv_folder/",format='com.databricks.spark.csv',header='true',inferSchema='true')
Alberto Bonsanto
  • 17,556
  • 10
  • 64
  • 93
Neel Tiwari
  • 81
  • 1
  • 5
  • Thanks. Another question. How can I make the thousands of .json work parallel? Map&Reduce? – Fengyu Jun 20 '16 at 21:15
  • 1
    Also, note that from 2.0.0 onwards parsing csv will be a part of Spark itself and you won't have to rely on spark-csv anymore. – BenFradet Jun 21 '16 at 07:51