2

I'm trying to parse a webpage and create a site map using python. I've written the below piece of code -

import urllib2
from bs4 import BeautifulSoup
mypage = "http://example.com/"
page = urllib2.urlopen(mypage)

soup = BeautifulSoup(page,'html.parser')

all_links = soup.find_all('a')

for link in all_links:
    print link.get('href')

The above code prints all the links in example.com(external and internal).

  • I need to filter out the external links and print only the internal links, I know I can differentiate them using domain name "example.com" and "somethingelse.com" or whatever the name is, but I'm unable to the RE format to get this - or if there is any built in library that helps in achieveing this
  • Once I get all the internal links - how do I map them. For instance "example.com" has link to "example.com/page1" which has link to "example.com/page3". What is the ideal way to create a map for this kind of flow ? I'm looking for a library or logic which shows "example.com" -> "example.com/page1" -> "example.com/page3" or something similar
Firstname
  • 355
  • 4
  • 8
  • 16

2 Answers2

5

I have written a piece of code for generating sitemap.xml file in python flask frame work

import xml.etree.cElementTree as ET
import datetime

    def registerSiteMaps():
        root = ET.Element('urlset')
        root.attrib['xmlns:xsi']="http://www.w3.org/2001/XMLSchema-instance"
        root.attrib['xsi:schemaLocation']="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd"
        root.attrib['xmlns']="http://www.sitemaps.org/schemas/sitemap/0.9"

        q = db.result

        for doc in q.results:
            uid = doc['uid']
            site_root = uid.replace('__', '/').replace('_', '-')
            dt = datetime.datetime.now().strftime ("%Y-%m-%d")
            doc = ET.SubElement(root, "url")
            ET.SubElement(doc, "loc").text = "https://www.example.com/"+site_root
            ET.SubElement(doc, "lastmod").text = dt
            ET.SubElement(doc, "changefreq").text = "weekly"
            ET.SubElement(doc, "priority").text = "1.0"

        tree = ET.ElementTree(root)
        tree.write('sitemap.xml', encoding='utf-8', xml_declaration=True)

For more details, follow this link

Aditya Anand
  • 105
  • 1
  • 5
2

For your first question, you can use urlparse to parse hostnames and check the domain. Do not use hand rolled regex to do that, it is much easier with a core library like that. See:

from urllib.parse import urlparse
parsed = urlparse(url)
hostname = parsed.hostname`

For your second question, your data structure looks like a graph doesn't it? You could use a custom graph data structure with nodes and links between them. Or you could use a graph database made for this purpose. Both of these solutions will be much complex for your needs though. I think it might be better to just use a dictionary data type where key is the page's URL and value is a list of links in that page for example. You can't easily follow links in this case as in a graph but it will still do the trick. You can also keep another set data type for tracking links you have visited.

Anonta
  • 2,500
  • 2
  • 15
  • 25
tayfun
  • 3,065
  • 1
  • 19
  • 23