So I am trying to generate a XML which is of below structure.
<n:Brands>
<n:Brand>
<Name>234</Name>
<Test>34</Test>
</n:Brand>
<n:Brand>
<Name>234</Name>
<Test>34</Test>
</n:Brand>
</n:Brands>
Now I have the below code
public static void main(String[] args) {
SparkConf sparkConf = new SparkConf();
sparkConf.setAppName("Unit Test");
sparkConf.setMaster("local[2]");
JavaSparkContext javaSparkContext = new JavaSparkContext(sparkConf);
SQLContext sqlContext = new SQLContext(javaSparkContext);
final JavaRDD<Book> parallelize = javaSparkContext
.parallelize(Arrays.asList(Book.builder().name("234").test("34").build(),
Book.builder().name("234").test("34").build()));
final JavaRDD<Row> map = parallelize.map(book -> RowFactory.create(
book.getName(),
book.getTest()
));
final Dataset<Row> dataFrame = sqlContext.createDataFrame(map, new StructType(new StructField[]{
new StructField("Name", DataTypes.StringType, true, Metadata.empty()),
new StructField("Test", DataTypes.StringType, true, Metadata.empty())
}));
dataFrame
.write()
.format("com.databricks.spark.xml")
.mode(SaveMode.Overwrite)
.option("rootTag", "n:Brands")
.option("rowTag", "n:Brand")
.save("out/path");
When i run this, it creates a two part-00000 and part-00001 file in the specified directory. Each of the file has a Root and Row tag present in it. When i copyMerge the partfiles, it will have those RootTag (n:Brands) duplicated.
Each Part file looks as below.
<n:Brands>
<n:Brand>
<Name>234</Name>
<Test>34</Test>
</n:Brand>
</n:Brands>
I use FileUtil to merge the partFile.
FileUtil.copyMerge(hdfs, new org.apache.hadoop.fs.Path(processLocation), hdfs,
new org.apache.hadoop.fs.Path(preparedLocation), false,
getFSConfiguration(), null);
When I merge two part files it becomes.
<n:Brands>
<n:Brand>
<Name>234</Name>
<Test>34</Test>
</n:Brand>
</n:Brands>
<n:Brands>
<n:Brand>
<Name>234</Name>
<Test>34</Test>
</n:Brand>
</n:Brands>
How can I avoid this RootTag getting duplicated in each Part file?
I don't want to use repartition(1)
because I have a huge dataset and single worker will not be able to handle it.