I'm trying to parse a webpage and create a site map using python. I've written the below piece of code -
import urllib2
from bs4 import BeautifulSoup
mypage = "http://example.com/"
page = urllib2.urlopen(mypage)
soup = BeautifulSoup(page,'html.parser')
all_links = soup.find_all('a')
for link in all_links:
print link.get('href')
The above code prints all the links in example.com
(external and internal).
- I need to filter out the external links and print only the internal links, I know I can differentiate them using domain name "example.com" and "somethingelse.com" or whatever the name is, but I'm unable to the RE format to get this - or if there is any built in library that helps in achieveing this
- Once I get all the internal links - how do I map them. For instance
"example.com"
has link to"example.com/page1"
which has link to"example.com/page3"
. What is the ideal way to create a map for this kind of flow ? I'm looking for a library or logic which shows"example.com" -> "example.com/page1" -> "example.com/page3"
or something similar