scraping xml data using import lxml.etree as ElementTree

Question

I am trying to learn how to scrape a website using XML. I am quite familiar with html but I noticed some of the websites I attempt to scrape have XML API's. If I am not mistaken it is faster and fairly simpler to scrape.

I have the following sample code.

txt = '''
<?xml version="1.0"?>
<DIV5 N="100" TYPE="HEADER">
   <HEAD>100: DIV5 Title</HEAD>
   <AUTH>
      <HED>Authority:</HED>
      <PSPACE>AUTHORITY SPACE</PSPACE>
   </AUTH>
   <SOURCE>
      <HED>Source:</HED>
      <PSPACE>Source Text.</PSPACE>
   </SOURCE>
   <DIV7 N="1" TYPE="SUB">
      <HEAD>1. DIV7Title 1</HEAD>
      <DIV8 N="1.1" TYPE="SECTION">
         <HEAD>1.1 DIV8Title 1</HEAD>
         <P> (1) Text 1</P>
      </DIV8>

      <DIV8 N="1.2" TYPE="SECTION">
         <HEAD>1.2 DIV8Title 2</HEAD>
         <P>(a) text 1 </P>
         <P>(ii) text 2 </P>
         <P>(2) text 2.1 </P>
      </DIV8>

      <DIV8 N="1.3" TYPE="SECTION">
         <HEAD>1.3 DIV8 Title 3</HEAD>
         <P> (ff) text 1 </P>
      </DIV8>
   </DIV7>
   <DIV6 N="A" TYPE="SUBPART">
      <HEAD>Subpart A: DIV6Title 1 </HEAD>
      <DIV7 N="2" TYPE="SUB">
         <HEAD>2 DIV7Title 2</HEAD>
         <DIV8 N="2.1" TYPE="SECTION">
            <HEAD>2.1 DIV8Title 1 </HEAD>
            <P>(a) text 1</P>
            <P>(b) text 2 </P>
            <P>(c) text 3</P>
         </DIV8>

         <DIV8 N="2.2" TYPE="SECTION">
            <HEAD>2.2 DIV8 Title2</HEAD>
            <P> (o) text</P>
         </DIV8>
      </DIV7>
      <DIV7 N="3" TYPE="SUB">
         <HEAD>3. DIV7 Title 3</HEAD>
         <DIV8 N="3.1" TYPE="SECTION">
            <HEAD>3.1 DIV8 Title 1</HEAD>
            <P>(r) text 1</P>
            <P>(s) text 2</P>
         </DIV8>
      </DIV7>
   </DIV6>
   <DIV6 N="B" TYPE="SUBPART">
      <HEAD>Subpart B: DIV6 Title 2</HEAD>
         <DIV8 N="12" TYPE="SECTION">
            <HEAD>12. DIV8 Title 1</HEAD>
            <P>7(a) text </P>
         </DIV8>
   </DIV6>
</DIV5>
'''

I have the following code:

import lxml.etree as ElementTree

tree = ElementTree.ElementTree(ElementTree.fromstring(txt))
root = tree.getroot()

sub_parts = root.findall(".//DIV6")

for sub in sub_parts:
   l2_title = subpart.find('.//HEAD').text
   .
   .# Unsure what to do after this part
   .

Problem I am Having:

DIV5 children are [HEAD, AUTH, SOURCE, DIV7, DIV6, DIV6] The above code seems to only grab DIV6 and its children. The code completely skips their DIV7 sibling and its children. How can I parse both at the same time?

Ideal Outcome:

L2Ci	L2Title	L3Ci	L3Title	L4Ci	L4Title	L5Ci	L5Title	L6Ci	L6Title
		1.	DIV7Title 1	1.1	DIV8Title 1			(1)	Text 1
		1.	DIV7Title 1	1.2	DIV8Title 2	(a)	text 1
		1.	DIV7Title 1	1.2	DIV8Title 2	(ii)	text 2
		1.	DIV7Title 1	1.2	DIV8Title 2	(ii)	text 2	(2)	text 2.1
		1.	DIV7Title 1	1.3	DIV8Title 3	(ff)	text 1
A	DIV6Title 1	2.	DIV7Title 2	2.1	DIV8Title 1	(a)	text 1
A	DIV6Title 1	2.	DIV7Title 2	2.1	DIV8Title 1	(b)	text 2
A	DIV6Title 1	2.	DIV7Title 2	2.1	DIV8Title 1	(c)	text 3
A	DIV6Title 1	2.	DIV7Title 2	2.2	DIV8Title 2	(o)	text 1
A	DIV6Title 1	3.	DIV7Title 3	3.1	DIV8Title 3	(r)	text 1
A	DIV6Title 1	3.	DIV7Title 3	3.1	DIV8Title 3	(s)	text 2
B	DIV6Title 2			12	DIV8Title 1	(a)	text

Thank You!

**What would be the most efficient way to...** is off-topic (not allowed), because it is asking for opinions. Please review [What topics can I ask about here?](https://stackoverflow.com/help/on-topic), [Don't advise on off-topic questions.](https://meta.stackoverflow.com/questions/276572/), [What topics can I ask about here?](https://stackoverflow.com/help/on-topic), [ask], [tour]. — Trenton McKinney, May 15 '23 at 19:00

score 1 · Accepted Answer · answered May 16 '23 at 12:16

Instead of trying to find tags by name, you can process all tags "recursively", e.g. with .iter()

You can keep track of the information you need to build a "row".

In this case, each time a <P> tag is encountered, the "row" is considered complete, and you can store the result.

Something like:

columns = [
   'L2Ci', 'L2Ci', 'L2Title', 'L3Ci', 'L3Title', 
   'L4Ci', 'L4Title', 'L5Ci', 'L5Title', 'L6Ci', 
   'L6Title'
]

row = dict.fromkeys(columns)
rows = []

tag_type = None

for item in tree.iter():
   if item.attrib.get('TYPE', '') == 'SUBPART':
      # New sub-part, empty out all previously seen values
      for key in row:
         row[key] = None
      row['L2Ci'] = item.attrib['N']
      tag_type = 'L2Title'

   if item.attrib.get('TYPE', '') == 'SUB':
      row['L3Ci'] = item.attrib['N']
      tag_type = 'L3Title'
  
   if item.attrib.get('TYPE', '') == 'SECTION':
      row['L4Ci'] = item.attrib['N']
      tag_type = 'L4Title'
  
   if item.tag == 'HEAD':
      if tag_type is not None:
         row[tag_type] = item.text.strip()
 
   if item.tag == 'P':
      row['P'] = item.text.strip()
      rows.append(row.copy())

You can then create a dataframe: df = pd.DataFrame(rows)

L2Ci	L2Title	L3Ci	L3Title	L4Ci	L4Title	P
		1	1. DIV7Title 1	1.1	1.1 DIV8Title 1	(1) Text 1
		1	1. DIV7Title 1	1.2	1.2 DIV8Title 2	(a) text 1
		1	1. DIV7Title 1	1.2	1.2 DIV8Title 2	(ii) text 2
		1	1. DIV7Title 1	1.2	1.2 DIV8Title 2	(2) text 2.1
		1	1. DIV7Title 1	1.3	1.3 DIV8 Title 3	(ff) text 1
A	Subpart A: DIV6Title 1	2	2 DIV7Title 2	2.1	2.1 DIV8Title 1	(a) text 1
A	Subpart A: DIV6Title 1	2	2 DIV7Title 2	2.1	2.1 DIV8Title 1	(b) text 2
A	Subpart A: DIV6Title 1	2	2 DIV7Title 2	2.1	2.1 DIV8Title 1	(c) text 3
A	Subpart A: DIV6Title 1	2	2 DIV7Title 2	2.2	2.2 DIV8 Title2	(o) text
A	Subpart A: DIV6Title 1	3	3. DIV7 Title 3	3.1	3.1 DIV8 Title 1	(r) text 1
A	Subpart A: DIV6Title 1	3	3. DIV7 Title 3	3.1	3.1 DIV8 Title 1	(s) text 2
B	Subpart B: DIV6 Title 2			12	12. DIV8 Title 1	7(a) text

You can implement the rest of the logic to populate the L5-6 tags from the P text.

scraping xml data using import lxml.etree as ElementTree

1 Answers1