1

Need to extract .co.uk urls from a file with lots of entries, some .com .us etc.. i need only the .co.uk ones. any way to do that? pd: im learning bash

edit:

code sample:

<a href="http://www.mysite.co.uk/" target="_blank">32</a>
<tr><td id="Table_td" align="center"><a href="http://www.ultraguia.co.uk/motets.php?pg=2" target="_blank">23</a><a name="23"></a></td><td id="Table_td"><input type="text" value="http://www.ultraguia.co.uk/motets.php?pg=2" size="57" readonly="true" style="border: none"></td>

note some repeat

important: i need all links, broken or 404 too

found this code somwhere in the net:

cat file.html | tr " " "\n" | grep .co.uk

output:

href="http://www.domain1.co.uk/"
value="http://www.domain1.co.uk/"
href="http://www.domain2.co.uk/"
value="http://www.domain2.co.uk/"

think im close

thanks!

user1478993
  • 121
  • 2
  • 6
  • Welcome to Stack Overflow. Please improve your question by posting some [properly formatted](http://stackoverflow.com/editing-help) code, all **relevant** error messages exactly as they appear, and whatever samples you're testing against. – Todd A. Jacobs Jun 25 '12 at 03:51
  • Does `grep \\.co\\.uk ` do the job? If not, please specify what the format is of the file that you are trying to extract from, or post a relevant example snippet of that file. – Reinier Torenbeek Jun 25 '12 at 04:00
  • that prints the whole file and highlights .co.uk . i need to extract the full url – user1478993 Jun 25 '12 at 23:25
  • Any `grep`, `sed` or `awk`-like solution can be made to fail with specific HTML constructs, for example comments. How robust does your solution have to be? If your current solution is robust enough, you can clean it up by appending `| grep href | sed 's/.*href=\"\(.*\)\"/\1/'` – Reinier Torenbeek Jun 26 '12 at 13:10
  • So dou you want to extract the full URLs, or just the domain names of your URLs? – Reinier Torenbeek Jun 26 '12 at 13:18

3 Answers3

0

Since there is no answer yet, I can provide you with an ugly but robust solution. You can exploit the wget command to grab the URLs in your file. Normally, wget is used to download from thos URLs, but by denying wget time for it lookup via DNS, it will not resolve anything and just print the URLs. You can then grep on those URLs that have .co.uk in them. The whole story becomes:

wget --force-html --input-file=yourFile.html --dns-timeout=0.001 --bind-address=127.0.0.1 2>&1 | grep -e "^\-\-.*\\.co\\.uk/.*"

If you want to get rid of the remaining timestamp information on each line, you can pipe the output through sed, as in | sed 's/.*-- //'.

If you do not have wget, then you can get it here

Reinier Torenbeek
  • 16,669
  • 7
  • 46
  • 69
0

One way using awk:

awk -F "[ \"]" '{ for (i = 1; i<=NF; i++) if ($i ~ /\.co\.uk/) print $i }' file.html

output:

http://www.mysite.co.uk/
http://www.ultraguia.co.uk/motets.php?pg=2
http://www.ultraguia.co.uk/motets.php?pg=2

If you are only interested in unique urls, pipe the output into sort -u

HTH

Steve
  • 51,466
  • 13
  • 89
  • 103
0

The following approach uses a real HTML engine to parse your HTML, and will thus be more reliable faced with CDATA sections or other syntax which is hard to parse:

links -dump http://www.google.co.uk/ -html-numbered-links 1 -anonymous \
  | tac \
  | sed -e '/^Links:/,$ d' \
        -e 's/[0-9]\+.[[:space:]]//' \
  | grep '^http://[^/]\+[.]co[.]uk'

It works as follows:

  • links (a text-based web browser) actually retrieves the site.
    • Using -dump causes the rendered page to be emitted to stdout.
    • Using -html-numbered-links requests a numbered table of links.
    • Using -anonymous tweaks defaults for added security.
  • tac reverses the output from Links in a line-ordered list
  • sed -e '/^Links:/,$ d' deletes everything after (pre-reversal, before) the table of links, ensuring that actual page content can't be misparsed
  • sed -e 's/[0-9]\+.[[:space:]]//' removes the numbered headings from the individual links.
  • grep '^https\?://[^/]\+[.]co[.]uk' finds only those links with their host parts ending in .co.uk.
Charles Duffy
  • 280,126
  • 43
  • 390
  • 441