1

I'm working on a system that that will be acting as an OLAP engine for a simulation toolchain dataset. The tools generate their results in XML.

The easiest and most simple solution to me would have been to simply use spark-xml to access the XML files directly with python, Scala, etc. But the problem is that the project owners want to use C# as that is what the original simulation toolchain is built in. I know there is SparkCLR for C# but I don't know of a good way of using Spark-XML within C#.

Does anyone have any suggestions on how to do this? If not I guess the next option would be to translate the datasets into something more native for SparkCLR but not sure of the best approach.

zero323
  • 322,348
  • 103
  • 959
  • 935
Kevin Vasko
  • 1,561
  • 3
  • 22
  • 45

2 Answers2

2

SparkCLR works with spark-xml. The following code shows how to use C# to process XML as Spark DataFrame. You can use this code sample to start building your XML processing C# application for Spark. This sample implements the same example available at https://github.com/databricks/spark-xml#scala-api. Note that you need to include spark-xml jar when submitting your job.

        var sparkConf = new SparkConf();
        var sparkContext = new SparkContext(sparkConf);
        var sqlContext = new SqlContext(sparkContext);

        var df = sqlContext.Read()
            .Format("com.databricks.spark.xml")
            .Option("rowTag", "book")
            .Load(@"D:\temp\spark-xml\books.xml");
        var selectedData = df.Select("author", "@id");
        selectedData.Write()
            .Format("com.databricks.spark.xml")
            .Option("rootTag", "books")
            .Option("rowTag", "book")
            .Save(@"D:\temp\spark-xml\newbooks.xml");
skaarthik
  • 377
  • 2
  • 6
0

I'm not aware of a good analog to Spark in the .NET world. P-LINQ may be the closest, but it's not distributed. Microsoft Azure offers Hadoop, R, etc. which you can use for distributed map-reduce type functionality. Hopefully the project owners understand you're facing much more effort to complete the work in C#.

J Burnett
  • 2,410
  • 1
  • 13
  • 9
  • I found this https://github.com/Microsoft/SparkCLR which allows me to write C# code to interact with Spark. But I am wanting to interact with XML code on Spark Spark-XML. Getting both of those to work together would be the challenge. The only thing I can thing of is a stopgap and using something to take the XML data (e.g. Apache Nifi, flume etc.) and store it in some other fashion that would be easier to work with in C#/SparkCLR. – Kevin Vasko Jan 19 '16 at 02:44