process XML in spark without external utility

Question

What I am trying to do- I have been asked to flatten an XML using Spark java but without com.databricks utility.

I have copied the XMLInputClass java code and using it. So that while processing via RDD, file split may not case an issue.

public class XMLParser implements Serializable{

protected   JavaPairRDD<LongWritable, Text> getInputRDD(JavaSparkContext sparkContext, String FileName){
     sparkContext.hadoopConfiguration().set(XMLInputFormat_New.START_TAG_KEY, "<book>");
     sparkContext.hadoopConfiguration().set(XMLInputFormat_New.END_TAG_KEY, "</book>");
     sparkContext.hadoopConfiguration().set(FileInputFormat.INPUT_DIR , FileName);
     return sparkContext.newAPIHadoopRDD(sparkContext.hadoopConfiguration(), XMLInputFormat_New.class, LongWritable.class, Text.class);
 }

public static void main(String[] args) throws Exception {
    System.out.println("start of program");
    SparkConf sparkConf = new SparkConf().setAppName("File_Validation").setMaster("local");
    JavaSparkContext sparkContext = new JavaSparkContext(sparkConf);
    XMLParser xmlParser = new XMLParser();

    JavaPairRDD<LongWritable, Text> lines =  xmlParser.getInputRDD(sparkContext, "hdfs://user/books.xml");

Now since I have got the data in lines RDD how do I proceed further to get the recordset. Kindly help

I know I have to use SAX Parser to flatten the code which has to be written inside RDD transformation function and then action has to be taken. But I am not getting any path to proceed further.

process XML in spark without external utility

0 Answers0