Filtering a variable so it only contains a specified string python

Question

I am trying to make link crawler in python; I know about harvestman but that's not what I am looking for. Here is what I have so far:

import httplib, sys

target=sys.argv[1]
subsite=sys.argv[2]
link = "http://"+target+subsite

def spider():
    while 1:
        conn = httplib.HTTPConnection(target)
        conn.request("GET", subsite)
        r2 = conn.getresponse()
        data = r2.read().split('\n')
        for x in data[:]:
            if link in x:
                print x
spider()

But I cant seem to find a way to filter x, so I can retrieve the links.

You may want to use an [html parser](http://docs.python.org/3/library/html.parser.html) to locate anchor tags and extract their `href` attributes. — grossvogel, Jun 17 '13 at 23:20
can you give me an example of an HTML parser? like most people I am not used to using HTML in python XD — D4zk1tty, Jun 17 '13 at 23:24
@JonClements I know about scrapy; I am having too many errors with it so far — D4zk1tty, Jun 17 '13 at 23:24

score 2 · Answer 1 · answered Jun 17 '13 at 23:30

If you're going down that route, then you can start with installing requests and bs4 to make life easier - and start your own spider template based on:

import requests
from bs4 import BeautifulSoup

page = requests.get('http://www.google.com')
soup = BeautifulSoup(page.text)
# Find all anchor tags that have an href attribute
print [a['href'] for a in soup.find_all('a', {'href': True})]
# ['http://www.google.co.uk/imghp?hl=en&tab=wi', 'http://maps.google.co.uk/maps?hl=en&tab=wl', 'https://play.google.com/?hl=en&tab=w8', 'http://www.youtube.com/?gl=GB&tab=w1', 'http://news.google.co.uk/nwshp?hl=en&tab=wn', 'https://mail.google.com/mail/?tab=wm', 'https://drive.google.com/?tab=wo', 'http://www.google.co.uk/intl/en/options/', 'http://www.google.co.uk/history/optout?hl=en', '/preferences?hl=en', 'https://accounts.google.com/ServiceLogin?hl=en&continue=http://www.google.co.uk/', '/advanced_search?hl=en-GB&authuser=0', '/language_tools?hl=en-GB&authuser=0', 'https://www.google.com/intl/en_uk/chrome/browser/promo/cubeslam/', '/intl/en/ads/', '/services/', 'https://plus.google.com/103583604759580854844', '/intl/en/about.html', 'http://www.google.co.uk/setprefdomain?prefdom=US&sig=0_cYDPGyR7QbF1UxGCXNpHcrj09h4%3D', '/intl/en/policies/']

cant import requests, anyways I am sticking with Joran Beasly's answer. — D4zk1tty, Jun 17 '13 at 23:34

score 1 · Accepted Answer · answered Jun 17 '13 at 23:20

1

I think would work

import re
re.findall("href=([^ >]+)",x)

answered Jun 17 '13 at 23:20

Joran Beasley

110,522
12
160
179

not really working; i get the HTML tag stuck at the end like this `['"http://www.thisislegal.com/user/Backbite"> – D4zk1tty Jun 17 '13 at 23:23

Filtering a variable so it only contains a specified string python

2 Answers2