wget grep sed to extract links and save them to a file

Question

I need to download all page links from http://en.wikipedia.org/wiki/Meme and save them to a file all with one command.

First time using the commmand line so I'm unsure of the exact commands, flags, etc to use. I only have a general idea of what to do and had to search around for what href means.

wget http://en.wikipedia.org/wiki/Meme -O links.txt | grep 'href=".*"' | sed -e 's/^.*href=".*".*$/\1/'

The output of the links in the file does not need to be in any specific format.

@cajole0110 You can't (usually) pipe the text if you save it to file instead. That's why either of BMW's commands work but not yours. — jpaugh, Feb 19 '14 at 01:26

BMW · Answer 1 · 2014-02-19T01:07:43.763

4

Using gnu grep:

grep -Po '(?<=href=")[^"]*' links.txt

or with wget

wget http://en.wikipedia.org/wiki/Meme -q -O - |grep -Po '(?<=href=")[^"]*'

edited Feb 19 '14 at 01:07

answered Feb 19 '14 at 00:02

BMW

42,880
12
99
116

You may also want to add the `-q` flag, to prevent printing the progress bar interleaved with the actual output (the progress bar is printed to stderr, so it doesn't interfere as such, it just looks funky). – Martin Tournoij Feb 19 '14 at 00:17

score 1 · Answer 2 · edited May 23 '17 at 11:53

1

You could use wget's spider mode. See this SO answer for an example.

wget spider

edited May 23 '17 at 11:53

Community

1
1

answered Feb 19 '14 at 00:43

Ken

7,847
1
21
20

score 0 · Answer 3 · answered Feb 19 '14 at 09:32

wget http://en.wikipedia.org/wiki/Meme -O links.txt | sed -n 's/.*href="\([^"]*\)".*/\1/p'

but this only take 1 href per line, if there is more than 1, other are lost (same as your original line). You also forget to have a group (\( -> \)) in your orginal sed first pattern so \1 refere to nothing

wget grep sed to extract links and save them to a file

3 Answers3