I'm using SparkSQL in a Java application to do some processing on CSV files using Databricks for parsing.
The data I am processing comes from different sources (Remote URL, local file, Google Cloud Storage), and I'm in the habit of turning everything into an InputStream so that I can parse and process data without knowing where it came from.
All the documentation I've seen on Spark reads files from a path, e.g.
SparkConf conf = new SparkConf().setAppName("spark-sandbox").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
SQLContext sqlc = new SQLContext(sc);
DataFrame df = sqlc.read()
.format("com.databricks.spark.csv")
.option("inferSchema", "true")
.option("header", "true")
.load("path/to/file.csv");
DataFrame dfGrouped = df.groupBy("varA","varB")
.avg("varC","varD");
dfGrouped.show();
And what I'd like to do is read from an InputStream, or even just an already-in-memory string. Something like the following:
InputStream stream = new URL(
"http://www.sample-videos.com/csv/Sample-Spreadsheet-100-rows.csv"
).openStream();
DataFrame dfRemote = sqlc.read()
.format("com.databricks.spark.csv")
.option("inferSchema", "true")
.option("header", "true")
.load(stream);
String someString = "imagine,some,csv,data,here";
DataFrame dfFromString = sqlc.read()
.format("com.databricks.spark.csv")
.option("inferSchema", "true")
.option("header", "true")
.read(someString);
Is there something simple I'm missing here?
I've read a bit of the docs on Spark Streaming and custom receivers, but as far as I can tell, this is for opening a connection that will be providing data continuously. Spark Streaming seems to break the data into chunks and do some processing on it, expecting more data to come in an unending stream.
My best guess here is that Spark as a descendant of Hadoop, expects large amounts of data that probably resides in a filesystem somewhere. But since Spark does its processing in-memory anyway, it made sense to me for SparkSQL to be able to parse data already in memory.
Any help would be appreciated.