-2

I have an HTML file like this:(More than 100 records)

<div class="cell-62 pl-1 pt-0_5">
    <h3 class="very-big-text light-text">John Smith</h3>
        <span class="light-text">Center - VAR - Employee I</span>
</div>

<div class="cell-62 pl-1 pt-0_5">
    <h3 class="very-big-text light-text">Jenna Smith</h3>
        <span class="light-text">West - VAR - Employee I</span>
</div>

<div class="cell-62 pl-1 pt-0_5">
    <h3 class="very-big-text light-text">Jordan Smith</h3>
        <span class="light-text">East - VAR - Employee II</span>
</div>

I need to extract the names IF they are Employee I, which makes it challenging. How can I select those tags that have Employee I in the next tag? Or should I use a different method? Is it even possible to use condition in this case?

with open("file.html", 'r') as input:
html = input.read()
    print(re.search(r'\bEmployee I\b',html).group(0))

Like, how can I specify to go to read previous tag?

HedgeHog
  • 22,146
  • 4
  • 14
  • 36
user15109593
  • 105
  • 5
  • 1
    You are looking for an XPATH tutorial. – Mad Physicist Sep 20 '22 at 13:54
  • 2
    If you are going to do this more often: have you tried using [BeautifulSoup](https://pypi.org/project/beautifulsoup4/)? – 9769953 Sep 20 '22 at 13:54
  • I'm not sure if BS4 would make any difference here, since the I have an issue with IF condition.. – user15109593 Sep 20 '22 at 13:55
  • What is your issue? What have you tried? Can you share the code? – HuLu ViCa Sep 20 '22 at 14:00
  • I can't think of a way to search for a word and if the word matches, read the previous line..just shared my code which is just 2 line to find the words.. – user15109593 Sep 20 '22 at 14:02
  • Do try using BeautifulSoup. It will make this almost trivial. By contrast, a regex search of the raw string content is completely unsuitable here. – Konrad Rudolph Sep 20 '22 at 14:04
  • BeautifulSoup allows plenty of filtering: on tags, attributes, CSS, classes and contents. It also allows you to move sideways (siblings) and up and down the HTML tree (parents and children). – 9769953 Sep 20 '22 at 14:10

2 Answers2

3
import re
from bs4 import BeautifulSoup

with open('inputfile.html', encoding='utf-8') as fp:
    soup = BeautifulSoup(fp.read(), 'html.parser')

names = [span.parent.find('h3').string 
         for span in 
         soup.find_all('span', 
                       class_='light-text', 
                       string=re.compile('Employee I$'))
        ]
print(names)

gives

['John Smith', 'Jenna Smith']

I've formatted the list comprehension over several lines, for clarity, so that it may be easier to see where to adjust things accordingly to other use cases. Of course, a normal for-loop and appending to a list also works fine; I just like list comprehensions.

The re.compile('Employee I$') is necessary to avoid matching on 'Employee II'. The class_ argument is an extra, and may not be needed.

The rest is near self-explanatory, especially with the BeautifulSoup documentation next to it.

Note that if the .string attribute used to be .text, in case you're using an older version of BeautifulSoup.

9769953
  • 10,344
  • 3
  • 26
  • 37
0
from bs4 import BeautifulSoup

test = '''<div class="cell-62 pl-1 pt-0_5">
        <h3 class="very-big-text light-text">John Smith</h3>
                <span class="light-text">Center - VAR - Employee I</span>
        </div>

        <div class="cell-62 pl-1 pt-0_5">
            <h3 class="very-big-text light-text">Jenna Smith</h3>
                <span class="light-text">West - VAR - Employee I</span>
        </div>

        <div class="cell-62 pl-1 pt-0_5">
            <h3 class="very-big-text light-text">Jordan Smith</h3>
                <span class="light-text">East - VAR - Employee II</span>
        </div>'''

soup = BeautifulSoup(test)
for person in soup.findAll('div'):
    names = person.find('h3').text
    employee_nb = person.find('span').text.split('-')[2].strip()
    if employee_nb == "Employee I":
        print(names)
inarighas
  • 720
  • 5
  • 24