2

Is it possible to use wget mirror to save all links from an entire website and save those in a txt file?

If it's possible, how is it done? If not, are there other methods to do this?

EDIT:

I tried to run this:

wget -r --spider example.com

And got this result:

Spider mode enabled. Check if remote file exists.
--2015-10-03 21:11:54--  http://example.com/
Resolving example.com... 93.184.216.34, 2606:2800:220:1:248:1893:25c8:1946
Connecting to example.com|93.184.216.34|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1270 (1.2K) [text/html]
Remote file exists and could contain links to other resources -- retrieving.

--2015-10-03 21:11:54--  http://example.com/
Reusing existing connection to example.com:80.
HTTP request sent, awaiting response... 200 OK
Length: 1270 (1.2K) [text/html]
Saving to: 'example.com/index.html'

100%[=====================================================================================================>] 1,270       --.-K/s   in 0s      

2015-10-03 21:11:54 (93.2 MB/s) - 'example.com/index.html' saved [1270/1270]

Removing example.com/index.html.

Found no broken links.

FINISHED --2015-10-03 21:11:54--
Total wall clock time: 0.3s
Downloaded: 1 files, 1.2K in 0s (93.2 MB/s)

(Yes, I also tried using other websites with more internal links)
user1878980
  • 515
  • 2
  • 8
  • 21
  • Yes, that's how it should work. The actual site "example.com" has no internal links, so it just returns itself. Try a site that has links to other pages within the site and you should get more. Did you also want to get links to *external* sites? If so, the python script from @Randomazer is probably a better bet. – seumasmac Oct 03 '15 at 20:48
  • Actually, there is a similar question to yours at: http://stackoverflow.com/questions/2804467/spider-a-website-and-return-urls-only which may be of use. – seumasmac Oct 03 '15 at 20:51
  • Thanks a lot! That helps! – user1878980 Oct 03 '15 at 21:33

2 Answers2

0

Yes, using wget's --spider option. A command like:

wget -r --spider example.com

will get all links down to a depth of 5 (the default). You can then capture the output to a file, perhaps cleaning it up as you go. Something like:

wget -r --spider example.com 2>&1 | grep "http://" | cut -f 4 -d " " >> weblinks.txt

will put just the links into the weblinks.txt file (if your version of wget has slightly different output, you may need to tweak that command a little).

seumasmac
  • 2,174
  • 16
  • 7
  • Alright, thanks. I tried to copy the script you wrote, it didn't work out of the box. It created a weblinks.txt file but it only saved down the http://www.example.com in this case in the .txt file (I tried to input other websites as well). Maybe I need to tweak it, the problem is I dont know how. – user1878980 Oct 03 '15 at 18:53
  • Can you just run the first command and see what output it gives? Note that the only way it can figure out what other pages are there is by following the links on the page you give it. If there aren't any links to other pages, it won't find anything else. – seumasmac Oct 03 '15 at 18:54
  • It's difficult to add details in these comments, so you may find it easier to update your Question with details of what you've tried. – seumasmac Oct 03 '15 at 19:14
0

Or using python:

For exaple

import urllib, re

def do_page(url):
    f = urllib.urlopen(url)
    html = f.read()
    pattern = r"'{}.*.html'".format(url)
    hits = re.findall(pattern, html)
    return hits

if __name__ == '__main__':
    hits = []
    url = 'http://thehackernews.com/'
    hits.extend(do_page(url))
    with open('links.txt', 'wb') as f1:
        for hit in hits:
            f1.write(hit)

Out:

'http://thehackernews.com/2015/10/adblock-extension.html'
'http://thehackernews.com/p/authors.html'
'http://thehackernews.com/2015/10/adblock-extension.html'
'http://thehackernews.com/2015/10/adblock-extension.html'
'http://thehackernews.com/2015/10/adblock-extension.html'
'http://thehackernews.com/2015/10/adblock-extension.html'
'http://thehackernews.com/2015/10/adblock-extension.html'
'http://thehackernews.com/2015/10/adblock-extension.html'
'http://thehackernews.com/2015/10/data-breach-hacking.html'
'http://thehackernews.com/p/authors.html'
'http://thehackernews.com/2015/10/data-breach-hacking.html'
'http://thehackernews.com/2015/10/data-breach-hacking.html'
'http://thehackernews.com/2015/10/data-breach-hacking.html'
'http://thehackernews.com/2015/10/data-breach-hacking.html'
'http://thehackernews.com/2015/10/data-breach-hacking.html'
'http://thehackernews.com/2015/10/data-breach-hacking.html'
'http://thehackernews.com/2015/10/howto-Freeze-Credit-Report.html'
'http://thehackernews.com/p/authors.html'
'http://thehackernews.com/2015/10/howto-Freeze-Credit-Report.html'
'http://thehackernews.com/2015/10/howto-Freeze-Credit-Report.html'
'http://thehackernews.com/2015/10/howto-Freeze-Credit-Report.html'
'http://thehackernews.com/2015/10/howto-Freeze-Credit-Report.html'
'http://thehackernews.com/2015/10/howto-Freeze-Credit-Report.html'
'http://thehackernews.com/2015/10/howto-Freeze-Credit-Report.html'
'http://thehackernews.com/2015/10/experian-tmobile-hack.html'
'http://thehackernews.com/p/authors.html'
'http://thehackernews.com/2015/10/experian-tmobile-hack.html'
'http://thehackernews.com/2015/10/experian-tmobile-hack.html'
'http://thehackernews.com/2015/10/experian-tmobile-hack.html'
'http://thehackernews.com/2015/10/experian-tmobile-hack.html'
'http://thehackernews.com/2015/10/experian-tmobile-hack.html'
'http://thehackernews.com/2015/10/experian-tmobile-hack.html'
'http://thehackernews.com/2015/10/buy-google-domain.html'
'http://thehackernews.com/p/authors.html'
'http://thehackernews.com/2015/10/buy-google-domain.html'
'http://thehackernews.com/2015/10/buy-google-domain.html'
'http://thehackernews.com/2015/10/buy-google-domain.html'
'http://thehackernews.com/2015/10/buy-google-domain.html'
'http://thehackernews.com/2015/10/buy-google-domain.html'
'http://thehackernews.com/2015/10/buy-google-domain.html'
'http://thehackernews.com/2015/10/android-stagefright-vulnerability.html'
'http://thehackernews.com/p/authors.html'
'http://thehackernews.com/2015/10/android-stagefright-vulnerability.html'
'http://thehackernews.com/2015/10/android-stagefright-vulnerability.html'
'http://thehackernews.com/2015/10/android-stagefright-vulnerability.html'
'http://thehackernews.com/2015/10/android-stagefright-vulnerability.html'
'http://thehackernews.com/2015/10/android-stagefright-vulnerability.html'
'http://thehackernews.com/2015/10/android-stagefright-vulnerability.html'
'http://thehackernews.com/2015/09/digital-india-facebook.html'
'http://thehackernews.com/2015/09/digital-india-facebook.html'
'http://thehackernews.com/2015/10/buy-google-domain.html'
'http://thehackernews.com/2015/10/buy-google-domain.html'
'http://thehackernews.com/2015/09/winrar-vulnerability.html'
'http://thehackernews.com/2015/09/winrar-vulnerability.html'
'http://thehackernews.com/2015/09/chip-mini-computer.html'
'http://thehackernews.com/2015/09/chip-mini-computer.html'
'http://thehackernews.com/2015/09/edward-snowden-twitter.html'
'http://thehackernews.com/2015/09/edward-snowden-twitter.html'
'http://thehackernews.com/2015/10/android-stagefright-vulnerability.html'
'http://thehackernews.com/2015/10/android-stagefright-vulnerability.html'
'http://thehackernews.com/2015/09/quantum-teleportation-data.html'
'http://thehackernews.com/2015/09/quantum-teleportation-data.html'
'http://thehackernews.com/2015/09/iOS-lockscreen-hack.html'
'http://thehackernews.com/2015/09/iOS-lockscreen-hack.html'
'http://thehackernews.com/2015/09/xor-ddos-attack.html'
'http://thehackernews.com/2015/09/xor-ddos-attack.html'
'http://thehackernews.com/2015/09/truecrypt-encryption-software.html'
'http://thehackernews.com/2015/09/truecrypt-encryption-software.html'
Randomazer
  • 172
  • 5