0

I have tried to follow this documentation in the most precise way I could:

https://beam.apache.org/documentation/sdks/javadoc/2.0.0/org/apache/beam/sdk/io/xml/XmlIO.html

Please find below my codes :

public static void main(String args[])
{

    DataflowPipelineOptions options=PipelineOptionsFactory.as(DataflowPipelineOptions.class);
     options.setTempLocation("gs://balajee_test/stagging");
     options.setProject("test-1-130106");

     Pipeline p=Pipeline.create(options);

     PCollection<XMLFormatter> record= p.apply(XmlIO.<XMLFormatter>read()
             .from("gs://balajee_test/sample_3.xml")
             .withRootElement("book")
             .withRecordElement("author")
             .withRecordElement("title")
             .withRecordElement("genre")
             .withRecordElement("price")
             .withRecordElement("description")
             .withRecordClass(XMLFormatter.class)
             );

     record.apply(ParDo.of(new DoFn<XMLFormatter,String>(){
                @ProcessElement

                public void processElement(ProcessContext c)
                {
                    System.out.println(c.element().getAuthor());    
                }
             }));

     p.run(); 
}   

I'm getting 'null' value for every XML component. Could you please review my code and suggest me the corrective course of action required?

Test File

package com.bitwise.cloud;

import javax.xml.bind.annotation.XmlElement;
import javax.xml.bind.annotation.XmlRootElement;
import javax.xml.bind.annotation.XmlType;

@XmlRootElement(name = "book")
@XmlType(propOrder = {"author", "title","genre","price","description"})
public class XMLFormatter {
private String author;
private String title;
private String genre;
private String price;
private String description;

public XMLFormatter() { }

public XMLFormatter(String author, String title,String genre,String price,String description) {
this.author = author;
this.title = title;
this.genre = genre;
this.price = price;
this.description = description;
}

@XmlElement
public void setAuthor(String author) {
this.author = author;
}

public String getAuthor() {
return author;
}

@XmlElement
public void setTitle(String title) {
this.title = title;
}

public String getTitle() {
return title;
}

@XmlElement
public void setGenre(String genre) {
this.genre = genre;
}

public String getGenre() {
return genre;
}

@XmlElement
public void setPrice(String price) {
this.price = price;
}

public String getPrice() {
return price;
}


@XmlElement
public void setDescription(String description) {
this.description = description;
}

public String getDescription() {
return description;
}
}
Balajee Venkatesh
  • 1,041
  • 2
  • 18
  • 39
  • What runner are you using? The DirectRunner? Dataflow Runner? Something else? Do you have a job ID for the failing pipeline? – Ben Chambers Aug 29 '17 at 23:20
  • Tried using DirectRunner as well as DataflowRunner. I do have a job ID for the failing Pipeline (2017-08-30_02_55_24-4448720439481076797). Would you mind sharing a working code sample for reading an XML file? – Balajee Venkatesh Aug 30 '17 at 09:57
  • @BenChambers Could you please review my code?Still I haven't been able to resolve this issue. It would be quite helpful if you share any working snippet for the same. – Balajee Venkatesh Aug 31 '17 at 13:35
  • Can you provide your test file? – Lara Schmidt Aug 31 '17 at 23:27
  • Have you tried running a DirectPipeline from a txt file instead of a GCS file? That would allow you to know that the issue is reading GCS or with the XML formatter. Or you could try reading from a GCS file and writing the output by line to verify that the GCS file is able to be read. – Lara Schmidt Aug 31 '17 at 23:35
  • @LaraSchmidt Added the snap of my test file as well as code of 'XMLFormatter' class. Please check the codes and let me know the piece I'm doing wrong. – Balajee Venkatesh Sep 01 '17 at 13:38
  • Ok, thanks. Everything looks okay. Does the pipeline run if you read from a text file using direct runner? – Lara Schmidt Sep 01 '17 at 20:39
  • Hi Lara !!! The Pipeline runs and reads the file successfully if provided as a text file. It seems that 'XmlIo' is unable to map the XML elements to their corresponding setters and so getters of 'XMLFormatter' class. Getting 'null' values at output is literally surprising to me. Any thoughts? – Balajee Venkatesh Sep 01 '17 at 20:47
  • Any update? @LaraSchmidt – Balajee Venkatesh Sep 04 '17 at 05:29
  • Great support :D would be happy to see a working snippet, too! Since Apache is handling the library, documentation and solution-searching became quite annoying :-( – Malte Sep 18 '17 at 14:06
  • @BalajeeVenkatesh this issue has been triaged by Lara internally and is now being investigated by additional engineers. There is no ETA for the fix but they should reply here with additional information when it becomes available. Since this has been escalated internally you may wish to move this to the [Public Issue Tracker](https://cloud.google.com/support/docs/issue-trackers) as that is the correct place to report Google-end issues. – Jordan Sep 18 '17 at 15:36
  • I don't believe XMLIO supports record elements with different element names (author, title, genre, etc). You have to provide a single root element and a record element and your XML document has to contain records that have the same record element. See the example given in the doc string: https://github.com/apache/beam/blob/master/sdks/java/io/xml/src/main/java/org/apache/beam/sdk/io/xml/XmlIO.java#L59 – chamikara Sep 25 '17 at 03:25
  • @BalajeeVenkatesh any updates guys where you able to solve it. has this issue been resolved yet. – vikeng21 Nov 01 '19 at 17:56
  • @chamikara any updates guys where you able to solve it. has this issue been resolved yet. – vikeng21 Nov 01 '19 at 17:57
  • @LaraSchmidt has this issue been reolved Lara. If you provide me some updates it would be very helpful – vikeng21 Nov 01 '19 at 17:58

1 Answers1

1

XmlIO.Read PTransform doesn't support providing multiple record elements (author, title, genre, etc). You have to provide a single root element and a record element and your XML document has to contain records that have the same record element. See the example given in the following location.

https://github.com/apache/beam/blob/master/sdks/java/io/xml/src/main/java/org/apache/beam/sdk/io/xml/XmlIO.java#L59

chamikara
  • 1,896
  • 1
  • 9
  • 6