-2

I have the following code to parse from an xml file to produce a pandas dataframe. The XML file looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<Entries>
 <EntrySynopsisDetail_1_0>
        <EntryID>262148</EntryID>
        <EntryTitle>Establishment of the Graduate Internship Program</EntryTitle>
        <CategoryOfEntry>ENG</CategoryOfEntry>
</EntrySynopsisDetail_1_0>
<EntrySynopsisDetail_1_0>
<EntryID>2667654</EntryID>
        <EntryTitle>Call for Mobility Program</EntryTitle>
        <CategoryOfEntry>ENG</CategoryOfEntry>
        <CategoryOfEntry>MAT</CategoryOfEntry>
 </EntrySynopsisDetail_1_0>
</Entries>

And my code is below:

from bs4 import BeautifulSoup
import pandas as pd 

fd = open("file_120123.xml",'r')
data = fd.read()

Bs_data = BeautifulSoup(data,'xml')

ID = Bs_data.find_all('EntryID')
Title = Bs_data.find_all('EntryTitle')
try: 
   Cat = Bs_data.find_all('CategoryOfEntry')
except IndexError:
   Cat = ''

CatDict = {
    "ENG":"English",
    "MAT" :"Mathematics"
}

dataDf = []
for i in range(0,len(ID)):
      if (Cat[i] == CatDict):
        Cat[i] == CatDict.get(Cat[i])
       
      rows = [ID[i].get_text(), Title[i].get_text(), Cat[i])
      dataDf.append(rows)
    
df = pd.DataFrame(dataDf, columns =['ID', 'Title', 'Category'], dtype=float)
df.to_csv('120123.csv')

As you see, the code reads a xml file called 'file_120123.xml' using BeautifulSoup library, and calls each of the elements present in the file. Now one of the elements is a key and I have created a dictionary listing all possible keys. Not all parents have that element. I want to compare the extracted key with the ones in the dictionary and replace that with the value corresponding to that key.

With this code, I get the error IndexError: list index out of range on Cat[i] on if (Cat[i] == CatDict): line. Any insights on how to resolve this?

tachyon
  • 119
  • 8

2 Answers2

0

If you just want to avoid raising the error, add a conditional break

for i in range(0,len(ID)):
      if not i < len(Cat): break ## <-- break loop if length of Cat is exceeded

      if (Cat[i] == CatDict):
        Cat[i] == CatDict.get(Cat[i])
       
      rows = [ID[i].get_text(), Title[i].get_text(), Cat[i])
      dataDf.append(rows)
Driftr95
  • 4,572
  • 2
  • 9
  • 21
0

First, as to why lxml is better than BeautifulSoup for xml, the answer is simple: the best way to query xml is with xpath. lxml supports xpath (though only version 1.0; for more complex xml and queries you will need xpath 2.0 to 3.1 and a library like elementpath). BS doesn't support xpath, though it does have excellent support for css selectors, which works better with html.

Having said all that - in your particular case, you probably don't need lxml either - only pandas and a one liner! Though you haven't shown your expected output, my guess is you expect the output below. Note that in your sample xml there is probability an error: the 2nd <EntrySynopsisDetail_1_0> has <CategoryOfEntry> twice, so I removed one:

entries = """<Entries>
 <EntrySynopsisDetail_1_0>
        <EntryID>262148</EntryID>
        <EntryTitle>Establishment of the Graduate Internship Program</EntryTitle>
        <CategoryOfEntry>ENG</CategoryOfEntry>
</EntrySynopsisDetail_1_0>
  
<EntrySynopsisDetail_1_0>
        <EntryID>2667654</EntryID>
        <EntryTitle>Call for Mobility Program</EntryTitle>
        <CategoryOfEntry>MAT</CategoryOfEntry>
 </EntrySynopsisDetail_1_0>
</Entries>"""

pd.read_xml(entries,xpath="//EntrySynopsisDetail_1_0")

Output:

EntryID        EntryTitle                          CategoryOfEntry
0   262148      Establishment of the Graduate Internship Program    ENG
1   2667654     Call for Mobility Program                         MAT
Jack Fleeting
  • 24,385
  • 6
  • 23
  • 45
  • Thanks for the explanation. The XML file actually has multiple occurrences of CategoryOfEntry in many elements, and in some cases, it does not have this tag which is why there is an index error arising from my piece of code. – tachyon Jan 13 '23 at 11:44
  • Also, how do you replace ENG and MAT with English and Mathematics? I want the keys to be replaced with the provided values. – tachyon Jan 13 '23 at 11:45
  • 1
    @tachyon I see; in that case, it may, depending on the structure, become difficult or impossible to convert the xml into a dataframe; you may have to just query the xml directly with xpath (assuming it is otherwise well formed - if not, it's not really xml), or contact whoever creates the xml and ask them to fix the structure. – Jack Fleeting Jan 13 '23 at 11:47
  • that is a great insight - thanks. – tachyon Jan 13 '23 at 11:51