Extract .co.uk urls from HTML file

Question

Need to extract .co.uk urls from a file with lots of entries, some .com .us etc.. i need only the .co.uk ones. any way to do that? pd: im learning bash

edit:

code sample:

<a href="http://www.mysite.co.uk/" target="_blank">32</a>
<tr><td id="Table_td" align="center"><a href="http://www.ultraguia.co.uk/motets.php?pg=2" target="_blank">23</a><a name="23"></a></td><td id="Table_td"><input type="text" value="http://www.ultraguia.co.uk/motets.php?pg=2" size="57" readonly="true" style="border: none"></td>

note some repeat

important: i need all links, broken or 404 too

found this code somwhere in the net:

cat file.html | tr " " "\n" | grep .co.uk

output:

href="http://www.domain1.co.uk/"
value="http://www.domain1.co.uk/"
href="http://www.domain2.co.uk/"
value="http://www.domain2.co.uk/"

think im close

thanks!

Welcome to Stack Overflow. Please improve your question by posting some [properly formatted](http://stackoverflow.com/editing-help) code, all **relevant** error messages exactly as they appear, and whatever samples you're testing against. — Todd A. Jacobs, Jun 25 '12 at 03:51
Does `grep \\.co\\.uk ` do the job? If not, please specify what the format is of the file that you are trying to extract from, or post a relevant example snippet of that file. — Reinier Torenbeek, Jun 25 '12 at 04:00
that prints the whole file and highlights .co.uk . i need to extract the full url — user1478993, Jun 25 '12 at 23:25
Any `grep`, `sed` or `awk`-like solution can be made to fail with specific HTML constructs, for example comments. How robust does your solution have to be? If your current solution is robust enough, you can clean it up by appending `| grep href | sed 's/.*href=\"$.*$\"/\1/'` — Reinier Torenbeek, Jun 26 '12 at 13:10
So dou you want to extract the full URLs, or just the domain names of your URLs? — Reinier Torenbeek, Jun 26 '12 at 13:18

Reinier Torenbeek · Answer 1 · 2012-06-26T13:07:23.557

0

Since there is no answer yet, I can provide you with an ugly but robust solution. You can exploit the wget command to grab the URLs in your file. Normally, wget is used to download from thos URLs, but by denying wget time for it lookup via DNS, it will not resolve anything and just print the URLs. You can then grep on those URLs that have .co.uk in them. The whole story becomes:

wget --force-html --input-file=yourFile.html --dns-timeout=0.001 --bind-address=127.0.0.1 2>&1 | grep -e "^\-\-.*\\.co\\.uk/.*"

If you want to get rid of the remaining timestamp information on each line, you can pipe the output through sed, as in | sed 's/.*-- //'.

If you do not have wget, then you can get it here

edited Jun 26 '12 at 13:07

answered Jun 25 '12 at 13:24

Reinier Torenbeek

16,669
7
46
69

thats close, but takes way too long, and get only working links. – user1478993 Jun 25 '12 at 23:15
OK, I added some tricks to make it go faster. I was wrong about the statement that it gets working links only -- it shows all links that it tries to resolve. – Reinier Torenbeek Jun 26 '12 at 00:16

score 0 · Answer 2 · answered Jun 26 '12 at 00:07

0

One way using awk:

awk -F "[ \"]" '{ for (i = 1; i<=NF; i++) if ($i ~ /\.co\.uk/) print $i }' file.html

output:

http://www.mysite.co.uk/
http://www.ultraguia.co.uk/motets.php?pg=2
http://www.ultraguia.co.uk/motets.php?pg=2

If you are only interested in unique urls, pipe the output into sort -u

HTH

answered Jun 26 '12 at 00:07

Steve

51,466
13
89
103

awk: program limit exceeded: maximum number of fields size=32767 FILENAME="file.html" FNR=202 NR=202 – user1478993 Jun 26 '12 at 16:05

score 0 · Accepted Answer · answered Jun 26 '12 at 00:28

The following approach uses a real HTML engine to parse your HTML, and will thus be more reliable faced with CDATA sections or other syntax which is hard to parse:

links -dump http://www.google.co.uk/ -html-numbered-links 1 -anonymous \
  | tac \
  | sed -e '/^Links:/,$ d' \
        -e 's/[0-9]\+.[[:space:]]//' \
  | grep '^http://[^/]\+[.]co[.]uk'

It works as follows:

links (a text-based web browser) actually retrieves the site.
- Using -dump causes the rendered page to be emitted to stdout.
- Using -html-numbered-links requests a numbered table of links.
- Using -anonymous tweaks defaults for added security.
tac reverses the output from Links in a line-ordered list
sed -e '/^Links:/,$ d' deletes everything after (pre-reversal, before) the table of links, ensuring that actual page content can't be misparsed
sed -e 's/[0-9]\+.[[:space:]]//' removes the numbered headings from the individual links.
grep '^https\?://[^/]\+[.]co[.]uk' finds only those links with their host parts ending in .co.uk.

Extract .co.uk urls from HTML file

3 Answers3