Scrape links from www and save as txt files (Bash or Python)

Question

I have a small project at home, where I need to scrape a website for links every once in a while and save the links in a txt file.

The script need to run on my Synology NAS, therefore the script needs to be written in bash script or python without using any plugins or external libraries as I can't install it on the NAS. (to my knowledge anyhow)

A link looks like this:

<a href="http://www.example.com">Example text</a>

I want to save the following to my text file:

Example text - http://www.example.com

I was thinking I could isolate the text with curl and some grep (or perhaps regex). First I looked into using Scrapy or Beutifulsoup, but couldn't find a way to install it on the NAS.

Could one of you help me put a script together?

A typical web-page may contain many "http..." strings which are **NOT** links, and I'm pretty sure that you would not want to scrape those off the website. You probably want to find all the `` tags, and get the links from those elements only. Can you please provide the URL of the web-page that you want to scrape? — barak manos, Jan 25 '14 at 20:58

score 2 · Answer 1 · answered Jan 26 '14 at 17:09

You can use urllib2 that ships as free with Python. Using it you can easily get the html of any url

import urllib2
response = urllib2.urlopen('http://python.org/')
html = response.read()

Now, about the parsing the html. You can still use BeautifulSoup without installing it. From their site, it says "You can also download the tarball and use BeautifulSoup.py in your project directly". So search on internet for that BeautifulSoup.py file. If you can't find it, then download this one and save into a local file inside your project. Then use it like below:

soup = BeautifulSoup(html)
for link in soup("a"):
    print link["href"]
    print link.renderContents()

score 0 · Answer 2 · edited May 23 '17 at 11:58

I recommend using Python's htmlparser library. It will parse the page into a hierarchy of objects for you. You can then find the a href tags.

http://docs.python.org/2/library/htmlparser.html

There are lots of examples of using this library to find links, so I won't list all of the code, but here is a working example: Extract absolute links from a page using HTMLParser

EDIT:

As Oday pointed out, the htmlparser is an external library, and you may not be able to load it. In that case, here are two recommendations for built-in modules that can do what you need:

htmllib is included in Python 2.X.
xml is includes in Python 2.X and 3.X.

There is also a good explanation elsewhere on this site for how to use wget & grep to do the same thing:
Spider a Website and Return URLs Only

That's a nice suggestion, but I believe the OP said that he's not able to load external libraries or plugins. — Oday Mansour, Jan 26 '14 at 11:37

score 0 · Answer 3 · answered Jan 26 '14 at 17:11

Based on your example, you need something like this:

wget -q -O- https://dl.dropboxusercontent.com/s/wm6mt2ew0nnqdu6/links.html?dl=1 | sed -r 's#<a href="([^"]+)">([^<]+)</a>.*$#\2 - \1#' > links.txt

cat links.txt outputs:

1Visit W3Schools - http://www.w3schools.com/
2Visit W3Schools - http://www.w3schools.com/
3Visit W3Schools - http://www.w3schools.com/
4Visit W3Schools - http://www.w3schools.com/
5Visit W3Schools - http://www.w3schools.com/
6Visit W3Schools - http://www.w3schools.com/
7Visit W3Schools - http://www.w3schools.com/

Doesn't work. `sed: illegal option -- r usage: sed script [-Ealn] [-i extension] [file ...] sed [-Ealn] [-i extension] [-e script] ... [-f script_file] ... [file ...]` — Ahmad Awais, Aug 05 '16 at 15:52

Scrape links from www and save as txt files (Bash or Python)

3 Answers3