Extract data from STATIC HTML FILE using python 3.5

Question

I have static HTML page saved on local machine. I tried using simple file open and BeautifulSoup. With file open its doesn't read entire html file due to unicode error and BeautifulSoup it works for live websites.

#with beautifulSoup
from bs4 import BeautifulSoup
import urllib.request
url="Stack Overflow.html"
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page.read())
universities=soup.find_all('a',class_='institution')
for university in universities:
    print(university['href']+","+university.string)


#Simple file read
with open('Stack Overflow.html', encoding='utf-8') as f:
    for line in f:
        print(repr(line))

After reading HTML, I wish to extract data from ul and li which doesn't have any attributes. Any recommendation are welcome.

Which error(s) are you getting? What exactly do you want to extract from the page? Post the HTML contents (relevant part(s)) and your desired output. — alecxe, Jan 03 '17 at 04:28

yumere · Accepted Answer · 2017-01-03T06:28:55.780

1

I don't know what you exactly mean. I just understand that you want to read entire html data from local storage and parse some DOM with bs4.

right?

I suggest some code here:

from bs4 import BeautifulSoup

with open("Stack Overflow.html", encoding="utf-8") as f:
    data = f.read()
    soup = BeautifulSoup(data, 'html.parser')
    # universities = soup.find_all('a', class_='institution')
    # for university in universities:
    #     print(university['href'] + "," + university.string)
    ul_list = soup.select("ul")
    for ul in ul_list:
        if not ul.attrs:
            for li in ul.select("li"):
                if not li.attrs:
                    print(li.get_text().strip())

edited Jan 03 '17 at 06:28

answered Jan 03 '17 at 05:16

yumere

199
1
9

Yes, I have an HTML file on my machine, which I wish to parse and extract values between the list tag which doesn't have any attribute with BS4 – user73324 Jan 03 '17 at 05:55
How to I extract value from tag example
- Contents of the text file:
, which doesn't have values
– user73324 Jan 03 '17 at 05:59
@user73324 I modified some code. Please see code again. But I'm not sure what you exactly want. If that code is wrong, please let me see some local html example and output what you expect. – yumere Jan 03 '17 at 06:34
I tried running it but doesn't return any values:
- dependency-check version: 1.4.3
- Report Generated On: Dec 30, 2016 at 13:33:27 UTC
- Dependencies Scanned: 0 (0 unique)
- Vulnerable Dependencies: 0
- Vulnerabilities Found: 0
- Vulnerabilities Suppressed: 0
- ...
– user73324 Jan 03 '17 at 14:08
with open('dependency-check-report.html', encoding='utf-8') as f: data = f.read() soup = BeautifulSoup(data, 'html.parser') li_list = soup.select("li") for li in li_list: if not li.attrs: for i in li.select("li"): if not i.attrs: print(i.get_text().strip()) – user73324 Jan 03 '17 at 14:09
@user73324 Your above code is wrong. You repeated `select("li")` two times. First `for`, you should get `ul_list` from the `soup` and then you can select `li`. Finally, you can get text value from `li` by using `get_text()` – yumere Jan 03 '17 at 14:17
with open('C:/dependency-check-report.html') as f: data = f.read() soup = BeautifulSoup(data, 'html.parser') li_list = soup.select('li') for li in li_list: dat = li.get_text() if dat.find(pairKeynameNumVulnDependencies) != -1: numVulnDependencies = re.findall(r'([0-9])', dat) if int(numVulnDependencies[0]) > 0: print("No") else: print("Go") – user73324 Jan 06 '17 at 13:55

宏杰李 · Answer 2 · 2017-01-03T15:14:51.627

This question is about how to construct a BeautifulSoup Object.

To parse a document, pass it into the BeautifulSoup constructor. You can pass in a string or an open filehandle:
from bs4 import BeautifulSoup

soup = BeautifulSoup(open("index.html"))

soup = BeautifulSoup("<html>data</html>")

Just pass a file object to BeautifulSoup, you do not need to specifically add encoding information, BS will handle it.

First, the document is converted to Unicode, and HTML entities are converted to Unicode characters:

If you have trouble with extracting data, you should post html code.

Extract:

import bs4

html = '''<ul class="indent"> <li><i>dependency-check version</i>: 1.4.3</li> <li><i>Report Generated On</i>: Dec 30, 2016 at 13:33:27 UTC</li> <li><i>Dependencies Scanned</i>:&nbsp;0 (0 unique)</li> <li><i>Vulnerable Dependencies</i>:&nbsp;0</li> <li><i>Vulnerabilities Found</i>:&nbsp;0</li> <li><i>Vulnerabilities Suppressed</i>:&nbsp;0</li> <li class="scaninfo">...</li>'''

soup = bs4.BeautifulSoup(html, 'lxml')
for i in soup.find_all('li', class_=False):
    print(i.text)

out:

dependency-check version: 1.4.3
Report Generated On: Dec 30, 2016 at 13:33:27 UTC
Dependencies Scanned: 0 (0 unique)
Vulnerable Dependencies: 0
Vulnerabilities Found: 0
Vulnerabilities Suppressed: 0

with open('C:/dependency-check-report.html') as f: data = f.read() soup = BeautifulSoup(data, 'html.parser') li_list = soup.select('li') for li in li_list: dat = li.get_text() if dat.find(pairKeynameNumVulnDependencies) != -1: numVulnDependencies = re.findall(r'([0-9])', dat) if int(numVulnDependencies[0]) > 0: print("No") else: print("Go") — user73324, Jan 06 '17 at 13:55

Extract data from STATIC HTML FILE using python 3.5

2 Answers2