How to scrape the different content with the same html attributes and values?

Question

I'm able to scrape a bunch of data from a webpage, but I'm struggling with extracting the specific content from subsections that have the exact same attributes and values. Here is the html:

   <li class="highlight">
     Relationship Issues
      </li>
   <li class="highlight">
     Depression
      </li>
   <li class="highlight">
     Spirituality
      </li>

                                            <li class="">
                                                           ADHD
                                                   </li>
                                           <li class="">
                                                           Alcohol Use
                                                   </li>
                                           <li class="">
                                                           Anger Management
                                                   </li>

Using that html as a reference I have the following:

import requests
from bs4 import BeautifulSoup
import html5lib
import re

headers = {'User-Agent': 'Mozilla/5.0'}
URL = "website.com"


page = requests.get(URL, headers=headers)

soup = BeautifulSoup(page.content, 'html5lib')

specialties = soup.find_all('div', {'class': 'spec-list attributes-top'})

for x in specialties:
   Specialty_1 = x.find('li', {'class': 'highlight'}).text
   Specialty_2 = x.find('li', {'class': 'highlight'}).text
   Specialty_3 = x.find('li', {'class': 'highlight'}).text

So the ideal outcome is to have: Specialty_1 = Relationship Issues; Specialty_2 = Depression; Specialty_3 = Spirituality

AND

Issue_1 = ADHD; Issue_2 = Alcohol Use; Issue_3 = Anger Management

Would appreciate any and all help!

I think we need to see more of the html. At the moment, you are simply selecting the first li, if present, 3 times. You really want a loop over a list of the li elements. Can you share the url? As the loop is currently set up, you would also overwrite the variables within the loop. — QHarr, Oct 23 '20 at 06:36
Here's the url: https://www.psychologytoday.com/us/therapists/gary-l-phillips-northfield-il/43578 — Tom, Oct 23 '20 at 21:29
The issue I'm running into is that there are several li later on with data that I want without a value. How do I address that? — Tom, Oct 23 '20 at 21:30
you asked for those 3 values. I have put that in the bottom half of my answer. What else did you need from that page please?# — QHarr, Oct 23 '20 at 23:03
@QHarr I added some more HTML with the li attribute but no value, how do you identify those? Your code works great for li.highlight — Tom, Oct 24 '20 at 04:06
How do you want those returned? Does Andrej's answer do what you need? — QHarr, Oct 24 '20 at 04:14
Both yours and Andrej's answers work for the earlier code, but once I introduce the html with just li and no highlight, i have issues — Tom, Oct 24 '20 at 04:21
I just added the html and t the ideal outlook from that html. The issue I'm having is that the li sections without a class are being difficult to extract data from. — Tom, Oct 24 '20 at 04:31

score 0 · Answer 1 · answered Oct 23 '20 at 05:23

0

You can just use xpath if you know it will be in the same element structure in the tree. Most of the time you can right click an element in chrome devtools to get both a selector and an xpath string.

answered Oct 23 '20 at 05:23

Dan Weber

401
2
9

QHarr · Accepted Answer · 2020-10-24T05:12:16.023

You could develop Andrej's dictionary idea and use if else based on class being present to determine prefix and extend the select to include the additional section. You need to reset the numbering for the new section e.g. with a flag

results = {}
flag = False
counter = 1

for j in soup.select(".specialties-list li, .attributes-issues li"):
    if j['class']:
        results[f'Specialty_{counter}'] =  j.text.strip()
    else:   
        if not flag:
            counter = 1
            flag = True
        results[f'Issue_{counter}'] = j.text.strip()
    counter +=1 
        
print(results)

score 0 · Answer 3 · answered Oct 23 '20 at 07:27

If you want variable number of variables, use a dictionary. For example:

from bs4 import BeautifulSoup


html_doc = '''   <li class="highlight">
     Relationship Issues
      </li>
   <li class="highlight">
     Depression
      </li>
   <li class="highlight">
     Spirituality
      </li>
'''

soup = BeautifulSoup(html_doc, 'html.parser')

out = {'Specialty_{}'.format(i): specialty.get_text(strip=True) for i, specialty in enumerate(soup.select("li.highlight"), 1)}

print(out)

Prints:

{'Specialty_1': 'Relationship Issues', 
 'Specialty_2': 'Depression', 
 'Specialty_3': 'Spirituality'}

How to scrape the different content with the same html attributes and values?

3 Answers3