0

I have the following xhtml file, which contains about 30-40 images. The file is auto-generated and the numbers of the image will change, but the {html text} content which should really be do not change. I was hoping someone could point me in the right direction.

I'm trying to parse/find these images to rename them from image#.png to {html text}.png.

Substring of the HTML:

<div class="s8a6d62e8" style="">Top 10 ARP sources in terms of bits.</div>
<div class="sbeea9846" style="">
    <img style="width: 701px; height: 526px; border: 0px" src="Final Test Report_3.files\Final Test Report_34.Png"></img>
</div>
<div class="s306f0049" style="">Figure 3 - Top Ten ARP MAC Sources</div>
<div class="s12d95b95" style="">
    <a name="Top Ten ARP MAC Destinations"><br></a>
</div>
<div class="s1a75bf07" style="">Top Ten ARP MAC Destinations</div>
<div class="s8a6d62e8" style="">Top 10 ARP destinations in terms of bits.</div>
<div class="sbeea9846" style="">
    <img style="width: 701px; height: 526px; border: 0px" src="Final Test Report_3.files\Final Test Report_35.Png"></img>
</div>
<div class="s306f0049" style="">Figure 4 - Top Ten ARP MAC Destinations</div>
<div class="s1a75bf07" style="">ARP MAC Conversations</div>
<div class="s8a6d62e8" style="">Conversation ring with ARP endpoints and conversations.</div>
<div class="sbeea9846" style="">
    <img style="width: 701px; height: 526px; border: 0px" src="Final Test Report_3.files\Final Test Report_36.Png"></img>
</div>
<div class="s306f0049" style="">Figure 5 - ARP MAC Conversations</div>

The output I would like is as follows:

Final Test Report_3.files\Top Ten ARP MAC Sources.Png
Final Test Report_3.files\Top Ten ARP MAC Destinations.Png
Final Test Report_3.files\ARP MAC Conversations.Png

etc.,

awm
  • 2,723
  • 2
  • 18
  • 26
  • Note that the tab is before the
    tag where the title is. The title appears below the image.
    – awm Sep 15 '12 at 17:21
  • http://tika.apache.org/1.2/parser.html seems to generate SAX events, so you need to store the src attribute in a state variable when the IMG tag is encountered, couple it with the #CDATA of the next DIV tag and push it to a list, and clear the state variable. – Vikdor Sep 15 '12 at 17:53
  • My only issue with that would if it's SAX, why not just use SAX -- why use Tika at all? – awm Sep 16 '12 at 16:28
  • Tika's HTMLParser seems to be a bit tolerant compared to traditional SAX parser, when it comes to parsing HTML snippets. SAX throws exception if HTML snippet is not a well-formed XML, where Tika can happily parse the snippet you posted in the question. – Vikdor Sep 16 '12 at 16:31
  • Seems like Tika was just not designed for what I'm asking it for. It isn't designed to parse the whole document. "div" elements are skipped, "content" of tags are ignored. I would have to rewrite most of the handlers to get it to do what I need. – awm Sep 16 '12 at 21:25
  • Right, I too realized that `div`s are being skipped. Not sure why? If not, my solution to parse the IMG and DIV would have worked perfectly. – Vikdor Sep 17 '12 at 02:19
  • Because "The
    elements contain no inherent semantic meaning" not sure why that is the case, but that's the reason that was posted.
    – awm Sep 17 '12 at 23:20
  • Is the formation of html something you can modify? – Vikdor Sep 18 '12 at 02:14
  • No, it's generated externally. – awm Sep 19 '12 at 12:00

0 Answers0