0

This code is for getying links from html webpages, but I want to make it give me only the links with certain words. For instance, only links that have this word in there urls: "www.mywebsite.com/word"

My code :

import httplib2
from BeautifulSoup import BeautifulSoup, SoupStrainer

http = httplib2.Http()
status, response = http.request('http://www.mywebsite.com')



for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('a')):

    if link.has_key('href'):
        print link['href']`
HavelTheGreat
  • 3,299
  • 2
  • 15
  • 34
theasker2000
  • 29
  • 4
  • 11

2 Answers2

2

You can use simple string search using in. Below example print only the links which has '/website-builder' in href.

if '/website-builder' in link['href']:
    print link['href']

Full Code:

import httplib2
from BeautifulSoup import BeautifulSoup, SoupStrainer

http = httplib2.Http()
status, response = http.request('http://www.mywebsite.com')

for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('a')):
    if link.has_key('href'):
        if '/website-builder' in link['href']:
          print link['href']

Sample Output:

/website-builder?linkOrigin=website-builder&linkId=hd.mainnav.mywebsite
/website-builder?linkOrigin=website-builder&linkId=hd.subnav.mywebsite.mywebsite
/website-builder?linkOrigin=website-builder&linkId=hd.subnav.hosting.mywebsite
/website-builder?linkOrigin=website-builder&linkId=ct.btn.stickynavigation.easy-to-use#easy-to-use
Vinod Sharma
  • 883
  • 6
  • 13
  • can you explain what is auto? and what do you want to count in auto? – Vinod Sharma Mar 11 '15 at 20:35
  • Also please accept the answer, i assume it's correct. – Vinod Sharma Mar 11 '15 at 20:38
  • just set a variable count to zero before the for loop & increment it by one for every match found. – Vinod Sharma Mar 11 '15 at 20:41
  • No Problem :). If want to thank you, then accept the answer :) If you don't know how to do that then here is the link explain that: http://stackoverflow.com/help/someone-answers – Vinod Sharma Mar 11 '15 at 20:47
  • count = 0 for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('a')): count = count + 1 print 'Count: ', count i make it before for loop and it doesont give me any useful sings – theasker2000 Mar 11 '15 at 21:14
  • [count = 0 for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('a')): count = count + 1 print 'Count: ', count ] – theasker2000 Mar 11 '15 at 21:15
  • what is the output you got? I suggest you post a different question with code in question. this question is only for searching. – Vinod Sharma Mar 11 '15 at 22:44
0

Here's what I came up with:

links = [link for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('a')) if link.find("word") != -1]
print links

Of course you should replace "word" with any word you wish to filter by.

marcusshep
  • 1,916
  • 2
  • 18
  • 31