What I am trying to do- I have been asked to flatten an XML using Spark java but without com.databricks utility.
I have copied the XMLInputClass java code and using it. So that while processing via RDD, file split may not case an issue.
public class XMLParser implements Serializable{
protected JavaPairRDD<LongWritable, Text> getInputRDD(JavaSparkContext sparkContext, String FileName){
sparkContext.hadoopConfiguration().set(XMLInputFormat_New.START_TAG_KEY, "<book>");
sparkContext.hadoopConfiguration().set(XMLInputFormat_New.END_TAG_KEY, "</book>");
sparkContext.hadoopConfiguration().set(FileInputFormat.INPUT_DIR , FileName);
return sparkContext.newAPIHadoopRDD(sparkContext.hadoopConfiguration(), XMLInputFormat_New.class, LongWritable.class, Text.class);
}
public static void main(String[] args) throws Exception {
System.out.println("start of program");
SparkConf sparkConf = new SparkConf().setAppName("File_Validation").setMaster("local");
JavaSparkContext sparkContext = new JavaSparkContext(sparkConf);
XMLParser xmlParser = new XMLParser();
JavaPairRDD<LongWritable, Text> lines = xmlParser.getInputRDD(sparkContext, "hdfs://user/books.xml");
Now since I have got the data in lines RDD how do I proceed further to get the recordset. Kindly help
I know I have to use SAX Parser to flatten the code which has to be written inside RDD transformation function and then action has to be taken. But I am not getting any path to proceed further.