0

I have deleted two of my question because i thought i was too big and i could not explained it neatly .

So i am trying to make it simple this time .

So i have an complex nested xml . I am parsing it in spark scala and i have to save all the data from the xml into text file .

NOTE:I need to save the data into text files because later i have to join this data with another file which is in text format . Also can i join my csv file format with json or perquet file format ?If yes then i may not need to convert my xml into text file .

This is my code where i am trying to save the xml into csv file but as csv does not allow to save array type so i am getting error .

I am looking for some solution where i would be able to extarct all elements of an array and save it into text file .

def main(args: Array[String]) {

    val conf = new SparkConf().setAppName("XML").setMaster("local");
    val sc = new SparkContext(conf); //Creating spark context
    val sqlContext = new org.apache.spark.sql.SQLContext(sc)

    val df = sqlContext.read.format("com.databricks.spark.xml").option("rowTag", "env:Body").load("C://Users//u6034690//Desktop//SPARK//trfsmallfffile//XML")
    val resDf = df.withColumn("FlatType", explode(df("env:ContentItem"))).select("FlatType.*")

    resDf.repartition(1).write
      .format("csv")//This does not support for array Type
      .option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ")
      .option("nullValue", "")
      .option("delimiter", "\t")
      .option("quote", "\u0000")
      .option("header", "true")
      .save("C://Users//u6034690//Desktop//SPARK//trfsmallfffile//XML//output")

    // val resDf = df.withColumn("FlatType", when(df("env:ContentItem").isNotNull, explode(df("env:ContentItem"))))
  }

This is producing me below output before saving

+---------+--------------------+
|  _action|            env:Data|
+---------+--------------------+
|   Insert|[fun:FundamentalD...|
|Overwrite|[sr:FinancialSour...|
|Overwrite|[pe:FinancialPeri...|
|Overwrite|[pe:FinancialPeri...|
|Overwrite|[pe:FinancialPeri...|
|Overwrite|[pe:FinancialPeri...|
|Overwrite|[pe:FinancialPeri...|
|Overwrite|[pe:FinancialPeri...|
|Overwrite|[pe:FinancialPeri...|
|Overwrite|[pe:FinancialPeri...|
|Overwrite|[pe:FinancialPeri...|
|Overwrite|[pe:FinancialPeri...|
|Overwrite|[fl:FinancialLine...|
|Overwrite|[fl:FinancialLine...|
|Overwrite|[fl:FinancialLine...|
|Overwrite|[fl:FinancialLine...|
|Overwrite|[fl:FinancialLine...|
|Overwrite|[fl:FinancialLine...|
|Overwrite|[fl:FinancialLine...|
|Overwrite|[fl:FinancialLine...|
+---------+--------------------+

Foe each unique env:Data i am expecting unique file that can be done using partition but how can i save it in text file .

I have to save all the elements from the array i mean all columns .

I hope this time i am making my question clear .

If required i can update schema also .

Sudarshan kumar
  • 1,503
  • 4
  • 36
  • 83

1 Answers1

0

Spark SQL has a direct write to csv option. Why not use that?

Here is the syntax:

resDf.write.option("your options").csv("output file path")

This should save your file directly to csv format.

  • csv does not support data type array – Sudarshan kumar Feb 06 '18 at 06:18
  • you are writing a dataframe to csv file right? That's what I understood from last part of your code – Shrinivas Deshmukh Feb 06 '18 at 06:49
  • yes but we can not do that because csv does not allow array type .So my question is that how can we convert this type of xml into text or csv and then write into text file – Sudarshan kumar Feb 06 '18 at 06:55
  • I'm a bit confused here. You have created a dataframe 'df', then you applied some transformations and created a new dataframe resDf. And in the last part, you are writing resDf dataframe to csv. Right? – Shrinivas Deshmukh Feb 06 '18 at 07:31
  • Where exactly is it not working? While converting to dataframe or while writing to csv? Also, for joining, I suggest you to load both the files as dataframes, create view over those dataframes using registerTempTable, and then you can directly use sql join query. – Shrinivas Deshmukh Feb 06 '18 at 07:48
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/164593/discussion-between-shrinivas-deshmukh-and-sudarshan). – Shrinivas Deshmukh Feb 06 '18 at 07:49