1

Okay I am trying to select text data from the html below using python beautiful soup but I am having trouble. Basically there is a title within the <b>, but I want the data outside of that. For instance the first is assessment type, but I only want the capacity curve. Here is what I have so far:

modelinginfo = soup.find( "div", {"id":"genInfo"} ) # this is my raw data
rows=modelinginfo.findChildren(['p']) # this is the data displayed below
for row in rows:
    print(row)
    print('/n')
    cells = row.findChildren('p')
    for cell in cells:
         value = cell.string
         print("The value in this cell is %s" % value)


[<p><b>Assessment Type: </b>Capacity curve</p>,
 <p><b>Name: </b>Borzi et al (2008) - Capacity-Xdir 4Storeys InfilledFrame NonSismicallyDesigned</p>,
 <p><b>Category: </b>Structure specific - Building</p>,
 <p><b>Taxonomy: </b>CR/LFINF+DNO/HEX:4 (GEM)</p>,
 <p><b>Reference: </b>The influence of infill panels on vulnerability curves for RC buildings (Borzi B., Crowley H., Pinho R., 2008) - Proceedings of the 14th World Conference on Earthquake Engineering, Beijing, China</p>,
 <p><b>Web Link: </b><a href="http://www.iitk.ac.in/nicee/wcee/article/14_09-01-0111.PDF" style="color:blue" target="_blank"> http://www.iitk.ac.in/nicee/wcee/article/14_09-01-0111.PDF</a></p>,
 <p><b>Methodology: </b>Analytical</p>,
 <p><b>General Comments: </b>Sample Data: A 4-storey building designed according to the 1992 Italian design code (DM, 1992), considering gravity loads only, and the Decreto Ministeriale 1996 (DM, 1996) when considering seismic action (the seismically designed building has been designed assuming a lateral force equal to 10% of the seismic weight, c=10%, and with a triangular distribution shape).

 The Y axis in the capacity curve represent the collapse multiplier: Base shear resistance over seismic weight.</p>,
 <p><b>Geographical Applicability: </b> Italy</p>]
MattDMo
  • 100,794
  • 21
  • 241
  • 231
Corncobpipe
  • 51
  • 1
  • 7
  • 1
    Can you add a link to the site? – Padraic Cunningham May 12 '16 at 18:56
  • maybe you can fetch the whole string by using conditional statements and then split and remove the unnecessary stuff... – dot.Py May 12 '16 at 19:14
  • The site is password protected so a link wouldn't help. Dot_Py could you explain how to do that? I am newish to python so it is a little difficult to understand how? – Corncobpipe May 12 '16 at 19:17
  • Possible duplicate of [Only extracting text from this element, not its children](https://stackoverflow.com/questions/4995116/only-extracting-text-from-this-element-not-its-children) – Alex78191 Oct 25 '17 at 14:51

1 Answers1

1

1.) You can iterate over p children and print everything, except of b tag:

for cell in cells:
    for element in cell.children:
        if element.name != 'b':
            print("The value in this cell is %s" % element)

2.) You can use extract() method to clean up unneeded for you b tag:

for cell in cells:
    if cell.b:
        # remove "b" tag
        cell.b.extract()
    print("The value in this cell is %s" % cell)
Eugen
  • 1,465
  • 1
  • 13
  • 17