How to avoid running into IndexError: list index out of range error if an element is nonexistent while parsing xml with BeautifulSoup in python

Question

I have the following code to parse from an xml file to produce a pandas dataframe. The XML file looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<Entries>
 <EntrySynopsisDetail_1_0>
        <EntryID>262148</EntryID>
        <EntryTitle>Establishment of the Graduate Internship Program</EntryTitle>
        <CategoryOfEntry>ENG</CategoryOfEntry>
</EntrySynopsisDetail_1_0>
<EntrySynopsisDetail_1_0>
<EntryID>2667654</EntryID>
        <EntryTitle>Call for Mobility Program</EntryTitle>
        <CategoryOfEntry>ENG</CategoryOfEntry>
        <CategoryOfEntry>MAT</CategoryOfEntry>
 </EntrySynopsisDetail_1_0>
</Entries>

And my code is below:

from bs4 import BeautifulSoup
import pandas as pd 

fd = open("file_120123.xml",'r')
data = fd.read()

Bs_data = BeautifulSoup(data,'xml')

ID = Bs_data.find_all('EntryID')
Title = Bs_data.find_all('EntryTitle')
try: 
   Cat = Bs_data.find_all('CategoryOfEntry')
except IndexError:
   Cat = ''

CatDict = {
    "ENG":"English",
    "MAT" :"Mathematics"
}

dataDf = []
for i in range(0,len(ID)):
      if (Cat[i] == CatDict):
        Cat[i] == CatDict.get(Cat[i])
       
      rows = [ID[i].get_text(), Title[i].get_text(), Cat[i])
      dataDf.append(rows)
    
df = pd.DataFrame(dataDf, columns =['ID', 'Title', 'Category'], dtype=float)
df.to_csv('120123.csv')

As you see, the code reads a xml file called 'file_120123.xml' using BeautifulSoup library, and calls each of the elements present in the file. Now one of the elements is a key and I have created a dictionary listing all possible keys. Not all parents have that element. I want to compare the extracted key with the ones in the dictionary and replace that with the value corresponding to that key.

With this code, I get the error IndexError: list index out of range on Cat[i] on if (Cat[i] == CatDict): line. Any insights on how to resolve this?

It's not a good idea to parse xml w/ BS; try either lxml or pandas.read_xml(). Also, please edit your question and add a short, representative sample of your xml as well as your expected output. — Jack Fleeting, Jan 12 '23 at 15:02
Please edit your question and put there a sample of the XML and what information you're trying to get. — Andrej Kesely, Jan 12 '23 at 16:31
@JackFleeting I have added the XML details. Could you explain why BS is not a good idea to parse XML? I saw some examples showing that BS can also handle poorly structured XML. — tachyon, Jan 13 '23 at 07:37

score 0 · Answer 1 · answered Jan 12 '23 at 20:44

If you just want to avoid raising the error, add a conditional break

for i in range(0,len(ID)):
      if not i < len(Cat): break ## <-- break loop if length of Cat is exceeded

      if (Cat[i] == CatDict):
        Cat[i] == CatDict.get(Cat[i])
       
      rows = [ID[i].get_text(), Title[i].get_text(), Cat[i])
      dataDf.append(rows)

score 0 · Answer 2 · answered Jan 13 '23 at 11:39

First, as to why lxml is better than BeautifulSoup for xml, the answer is simple: the best way to query xml is with xpath. lxml supports xpath (though only version 1.0; for more complex xml and queries you will need xpath 2.0 to 3.1 and a library like elementpath). BS doesn't support xpath, though it does have excellent support for css selectors, which works better with html.

Having said all that - in your particular case, you probably don't need lxml either - only pandas and a one liner! Though you haven't shown your expected output, my guess is you expect the output below. Note that in your sample xml there is probability an error: the 2nd <EntrySynopsisDetail_1_0> has <CategoryOfEntry> twice, so I removed one:

entries = """<Entries>
 <EntrySynopsisDetail_1_0>
        <EntryID>262148</EntryID>
        <EntryTitle>Establishment of the Graduate Internship Program</EntryTitle>
        <CategoryOfEntry>ENG</CategoryOfEntry>
</EntrySynopsisDetail_1_0>
  
<EntrySynopsisDetail_1_0>
        <EntryID>2667654</EntryID>
        <EntryTitle>Call for Mobility Program</EntryTitle>
        <CategoryOfEntry>MAT</CategoryOfEntry>
 </EntrySynopsisDetail_1_0>
</Entries>"""

pd.read_xml(entries,xpath="//EntrySynopsisDetail_1_0")

Output:

EntryID        EntryTitle                          CategoryOfEntry
0   262148      Establishment of the Graduate Internship Program    ENG
1   2667654     Call for Mobility Program                         MAT

Thanks for the explanation. The XML file actually has multiple occurrences of CategoryOfEntry in many elements, and in some cases, it does not have this tag which is why there is an index error arising from my piece of code. — tachyon, Jan 13 '23 at 11:44
Also, how do you replace ENG and MAT with English and Mathematics? I want the keys to be replaced with the provided values. — tachyon, Jan 13 '23 at 11:45
@tachyon I see; in that case, it may, depending on the structure, become difficult or impossible to convert the xml into a dataframe; you may have to just query the xml directly with xpath (assuming it is otherwise well formed - if not, it's not really xml), or contact whoever creates the xml and ask them to fix the structure. — Jack Fleeting, Jan 13 '23 at 11:47

How to avoid running into IndexError: list index out of range error if an element is nonexistent while parsing xml with BeautifulSoup in python

2 Answers2