-2

I am trying to extract xml codes from html source. source is like this;

.
.
.
<h5>
 <u>A</u>
</h5>
<ul class="listss">
<li>
<d>
<a href="link">
 linktext
</a>
</d>
</li>
<li>
<d>
<a href="link2">
 linktext2
</a>
</d>
</li>
</ul>
<h5>
 <u>B</u>
</h5>
<ul class="listss">
 .\
 .(SAME TAGS AS ABOVE)
 ./
</ul>
<h5>
 <u>C</u>
</h5>
<ul class="listss">
 .\
 .(SAME TAGS AS ABOVE)
 ./
</ul>
<h5>
 <u>D</u>
</h5>
<ul class="listss">
 .\
 .(SAME TAGS AS ABOVE)
 ./
</ul>

Actually i need parent child relation so i need to extract node cell with xpath node first. But i couldn't achive to get range of xml code from "h5" to "/ul". So i need "h5" and "ul" tags together. Output must be like this;

<h5>
    <u>A</u>
</h5>
<ul class="listss">
 <li>
  <d>
   <a href="link">
    linktext
   </a>
  </d>
 </li>
 <li>
  <d>
   <a href="link2">
    linktext2
   </a>
  </d>
 </li>
</ul>

I searched tons of links and tried everything but none of these xpath codes worked;

/.../*[self::dns:h5 or self::dns:ul]
/.../*[self::dns:h5|self::dns:ul]
/.../*[self::h5 or self::ul]

Any idea, thanks.

t.ztrk
  • 99
  • 1
  • 9
  • 1
    Please add the desired output to your question with an [edit](https://stackoverflow.com/posts/59440705/edit). – zx485 Dec 22 '19 at 06:22
  • Your html has two h5/ul couples; what's the difference (to you) between the first and second? – Jack Fleeting Dec 22 '19 at 19:49
  • first h5 tag has years (in this example A, B, C, D), under years there are list of links. I just wanted to group and extract the year and it's links together. like A and it's links, B and it's links etc. If it is confusing, i can change second h5 tags which is under ul tag. – t.ztrk Dec 22 '19 at 23:19

1 Answers1

0

If you use Python, you can do this

from simplified_scrapy.simplified_doc import SimplifiedDoc 
html = '''<h5>
  <u>A</u>
</h5>
<ul class="listss">
  <li>
    <d>
      <a href="link">
        linktext
      </a>
    </d>
  </li>
  <li>
    <d>
      <a href="link2">
        linktext2
      </a>
    </d>
  </li>
</ul>
<h5>
  <u>B</u>
</h5>
<ul class="listss">
  .\
  .(SAME TAGS AS ABOVE)
  ./
</ul>
<h5>
  <u>C</u>
</h5>
<ul class="listss">
  .\
  .(SAME TAGS AS ABOVE)
  ./
</ul>
<h5>
  <u>D</u>
</h5>
<ul class="listss">
  .\
  .(SAME TAGS AS ABOVE)
  ./
</ul>'''
doc = SimplifiedDoc(html)
items = doc.children
lastName = None
for item in items:
  if item.tag == 'h5':
    lastName = item.text
  else:
    links = item.getElementsByTag('a')
    print (lastName,links)

result:

A [{'href': 'link', 'tag': 'a', 'html': 'linktext\n      '}, {'href': 'link2', 'tag': 'a', 'html': 'linktext2\n      '}]
B []
C []
D []
dabingsou
  • 2,469
  • 1
  • 5
  • 8