Spark: Read an inputStream instead of File

Question

I'm using SparkSQL in a Java application to do some processing on CSV files using Databricks for parsing.

The data I am processing comes from different sources (Remote URL, local file, Google Cloud Storage), and I'm in the habit of turning everything into an InputStream so that I can parse and process data without knowing where it came from.

All the documentation I've seen on Spark reads files from a path, e.g.

SparkConf conf = new SparkConf().setAppName("spark-sandbox").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
SQLContext sqlc = new SQLContext(sc);

DataFrame df = sqlc.read()
    .format("com.databricks.spark.csv")
    .option("inferSchema", "true")
    .option("header", "true")
    .load("path/to/file.csv");

DataFrame dfGrouped = df.groupBy("varA","varB")
    .avg("varC","varD");

dfGrouped.show();

And what I'd like to do is read from an InputStream, or even just an already-in-memory string. Something like the following:

InputStream stream = new URL(
    "http://www.sample-videos.com/csv/Sample-Spreadsheet-100-rows.csv"
    ).openStream();

DataFrame dfRemote = sqlc.read()
    .format("com.databricks.spark.csv")
    .option("inferSchema", "true")
    .option("header", "true")
    .load(stream);

String someString = "imagine,some,csv,data,here";

DataFrame dfFromString = sqlc.read()
    .format("com.databricks.spark.csv")
    .option("inferSchema", "true")
    .option("header", "true")
    .read(someString);

Is there something simple I'm missing here?

I've read a bit of the docs on Spark Streaming and custom receivers, but as far as I can tell, this is for opening a connection that will be providing data continuously. Spark Streaming seems to break the data into chunks and do some processing on it, expecting more data to come in an unending stream.

My best guess here is that Spark as a descendant of Hadoop, expects large amounts of data that probably resides in a filesystem somewhere. But since Spark does its processing in-memory anyway, it made sense to me for SparkSQL to be able to parse data already in memory.

Any help would be appreciated.

score 4 · Accepted Answer · answered Jul 25 '16 at 20:08

4

You can use at least four different approaches to make your life easier:

Use your input stream, write to a local file (fast with SSD), read with Spark.
Use Hadoop file system connectors for S3, Google Cloud Storage and turn everything into a file operation. (That won't solve the issue with reading from an arbitrary URL as there is no HDFS connector for this.)
Represent different input types as different URIs and create a utility function that inspects the URI and triggers the appropriate read operation.
Same as (3) but use case classes instead of a URI and simply overload based on the input type.

answered Jul 25 '16 at 20:08

Sim

13,147
9
66
95

1

I'm actually going for a 5th option here, which is to serialize the return into a JavaRDD, and then convert the RDD to a DataFrame. Thanks for your help. – Nate Vaughan Aug 01 '16 at 21:29
do you mind helping me to understand how you converted string to RDD then Dataframe? in my case, the inputsteam is a csv file with header(can convert it into a string). do I need to provide schema? or I can say .option("header", "true") so it will automatically detect the schema? – Ponns Nov 01 '18 at 15:05

Spark: Read an inputStream instead of File

1 Answers1

Linked