Getting a list of uls with wget using regex

Asked Apr 21 '18 at 19:20

Active Apr 22 '18 at 16:48

Viewed 277 times

I'm starting with page:

https://mysite/a"

I'd like to spider the page getting the full urls of any nested urls below this that begin with the same stem (like https://mysite/a/b ).

I've tried:

$ wget -r --spider --accept-regex "https://...*" 'https://.../' 2>test.txt

which produces a large amount of output inclusing what appear to be the urls I'm after like:

--2018-04-21 15:04:48--  https:/mysite/a/
Reusing existing connection to mysite:443.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: 'a/index.html.tmp.tmp'

How do I just print out a list of the urls?

Edit:

changed it to

$ wget -r --spider  'https://mysite/a/' |grep 'https://mysite/a*' 2>test.txt

as a test . No output is being saved in test.txt. The file is empty.

edited Apr 22 '18 at 16:48

asked Apr 21 '18 at 19:20

user1592380

34,265
92
284
515

Try grepping the output first, something like: `wget -r --spider -qO yourURL | grep "https:.*" 2> test.txt ` – builder-7000 Apr 21 '18 at 21:50
@Sergio please see edit. – user1592380 Apr 22 '18 at 12:24
Your regex `https://mysite/a*` will match URLs like `https://mysite/` or `https://mysite/aaaaa`, etc. Perhaps a real-world example will be more helpful. You can try testing a regex with your actual page content at https://regex101.com/ if you like. I'd suggest you try something fairly generic like `grep -o 'https?://[^" ]+'`. – ghoti Apr 22 '18 at 16:54
How do you send the output to the file>? 2>test.txt doesn't produce anything in the file. – user1592380 Apr 23 '18 at 02:16
from https://stackoverflow.com/questions/2804467/spider-a-website-and-return-urls-only : wget -q http://example.com -O - | \ tr "\t\r\n'" ' "' | \ grep -i -o ']\+href[ ]*=[ \t]*"$ht\|f$tps\?:[^"]\+"' | \ sed -e 's/^.*"$[^"]\+$".*$/\1/g' – user1592380 Apr 24 '18 at 00:49

Getting a list of uls with wget using regex

0 Answers0