1

So I am trying to generate a XML which is of below structure.

<n:Brands>
    <n:Brand>
        <Name>234</Name>
        <Test>34</Test>
    </n:Brand>
    <n:Brand>
        <Name>234</Name>
        <Test>34</Test>
    </n:Brand>
</n:Brands>

Now I have the below code

public static void main(String[] args) {
SparkConf sparkConf = new SparkConf();
sparkConf.setAppName("Unit Test");
sparkConf.setMaster("local[2]");
JavaSparkContext javaSparkContext = new JavaSparkContext(sparkConf);
SQLContext sqlContext = new SQLContext(javaSparkContext);

final JavaRDD<Book> parallelize = javaSparkContext
    .parallelize(Arrays.asList(Book.builder().name("234").test("34").build(),
        Book.builder().name("234").test("34").build()));

final JavaRDD<Row> map = parallelize.map(book -> RowFactory.create(
    book.getName(),
    book.getTest()
));

final Dataset<Row> dataFrame = sqlContext.createDataFrame(map, new StructType(new StructField[]{
    new StructField("Name", DataTypes.StringType, true, Metadata.empty()),
    new StructField("Test", DataTypes.StringType, true, Metadata.empty())
}));

dataFrame
    .write()
    .format("com.databricks.spark.xml")
    .mode(SaveMode.Overwrite)
    .option("rootTag", "n:Brands")
    .option("rowTag", "n:Brand")
    .save("out/path");

When i run this, it creates a two part-00000 and part-00001 file in the specified directory. Each of the file has a Root and Row tag present in it. When i copyMerge the partfiles, it will have those RootTag (n:Brands) duplicated.

Each Part file looks as below.

<n:Brands>
    <n:Brand>
        <Name>234</Name>
        <Test>34</Test>
    </n:Brand>
</n:Brands>

I use FileUtil to merge the partFile.

FileUtil.copyMerge(hdfs, new org.apache.hadoop.fs.Path(processLocation), hdfs,
          new org.apache.hadoop.fs.Path(preparedLocation), false,
          getFSConfiguration(), null);

When I merge two part files it becomes.

<n:Brands>
    <n:Brand>
        <Name>234</Name>
        <Test>34</Test>
    </n:Brand>
</n:Brands>
<n:Brands>
    <n:Brand>
        <Name>234</Name>
        <Test>34</Test>
    </n:Brand>
</n:Brands>

How can I avoid this RootTag getting duplicated in each Part file?

I don't want to use repartition(1) because I have a huge dataset and single worker will not be able to handle it.

zero323
  • 322,348
  • 103
  • 959
  • 935
Punith Raj
  • 2,164
  • 3
  • 27
  • 45
  • 1
    This is intended behavior. If you don't find it acceptable don't use the library (generate XML docs for each row yourself and add root tag manually on merge). – zero323 Apr 26 '18 at 12:11
  • @user6910411 But i think the library should have supported this. Is there any methods that could utilize all the worker nodes ? – Punith Raj Apr 26 '18 at 12:22
  • The solution I described would utilize workers. And should is disputable. If your data is to large to be processed by a single node, writing a single file is almost always a bad idea, and kind of leaks bad data management practices into your code. The only thing worse is "it must be gzipped" on top of that. Just saying... One way or another it doesn't (AFAIK). Of course you can open a PR and make your case. – zero323 Apr 26 '18 at 12:26

0 Answers0