How to search through links and save only those that contain a specific data?

Question

I have a text file with thousands of hyperlinks in the format "URL = http://examplelink.com" in a file called mylinks.txt.

What I want to do is search through all of these links, and checks if any of them contains some keywords, like, "2018", "2017". If the link contains the keyword, I want to save the link in the file "yes.txt" and if it doesn't it goes to the file "no.txt".

So at the end, I would end up with two files: one with the links that send me to pages with the keywords I'm searching for, and other one with the links that doesn't.

I was thinking about doing this with curl, but I don't know even if it's possible and I don't know also how to "filter" the links by keywords.

What I have got until now is:

curl -K mylinks.txt >> output.txt

But this only creates a super large file with the HTML's of the links it searches. I've searched and read through various curl tutorials and haven't found anything that "selectively" search for pages and save the links (not the content) of the pages it found matching the criteria.

*checks if any of them contains some keywords* - you mean `href` value or text value of a link? — RomanPerekhrest, Feb 04 '18 at 22:49
learn to script in python.. more yield and control. Probably installed by default on your linux/unix machine. In addition, add requested by Cryptopat. That gives the question more body and becomes more likely to survive and meet SO minimal rules for posting questions. Check also the grey-circled questionmark for mcve and more. End of review. — ZF007, Feb 04 '18 at 22:51

NVRM · Answer 1 · 2018-02-04T22:49:38.400

1

-Untested- For links in lines containing "2017" or "2018".

cat mylinks.txt | grep -E '2017|2018' | grep -o 'URL =*>' >> yes.txt

To get url of lines that doesn't contain the keywords.

cat mylinks.txt | grep -vE '2017|2018' | grep -o 'URL =*>' >> no.txt

This is unix piping. (The char | ) takes the program output stdout at the left and feed the stdin to the program on the right.

In Unix-like computer operating systems, a pipeline is a sequence of processes chained together by their standard streams, so that the output of each process (stdout) feeds directly as input (stdin) to the next one. https://en.wikipedia.org/wiki/Pipeline_(Unix)

edited Feb 04 '18 at 22:49

answered Feb 04 '18 at 22:32

NVRM

11,480
1
88
87

1

*and if it doesn't it goes to the file "**no.txt**".* *So at the end, I would end up with two files: one with the links that send me to pages with the keywords I'm searching for, and other one with the links that doesn't.* – RomanPerekhrest Feb 04 '18 at 22:42

Andrey Tyukin · Accepted Answer · 2018-02-04T22:45:03.857

0

Here is my take at it (kind of tested on an url-file with a few examples). This is supposed to be saved as a script, it's too long to type it into the console directly.

#!/bin/bash
urlFile="/path/to/myLinks.txt"
cut -d' ' -f3 "$urlFile" | \
while read url
do
  echo "checking url $url"
  if (curl "$url" | grep "2017") 
  then 
    echo "$url" >> /tmp/yes.txt
  else 
    echo "$url" >> /tmp/no.txt
  fi
done

Explanation: the cut is necessary to cut away the prefix "URL = " in each line. Then the url's are fed into the while-read loop. For each url, we curl it, grep for the interesting keyword in it (in this case "2017"), and if the grep returns 0, we append this URL to the file with the interesting URLs.

Obviously, you should adjust the paths and the keyword.

edited Feb 04 '18 at 22:45

answered Feb 04 '18 at 22:39

Andrey Tyukin

43,673
4
57
93

Thanks for the reply! The solution indeed would work but the links I got weren't working with curl for some reason. I got what I needed by using wget to save all html pages of the links and then doing 'grep' on the entire folder with the pages saved, saving the names of the file at the 'yes' or 'no' txt's. – Heitorado Feb 20 '18 at 12:56
I've tested it on a few pages that didn't require any log-ins, and it worked with `curl`. You might want to take a look at [this question](https://stackoverflow.com/questions/20986395/why-does-curl-not-work-but-wget-works?rq=1) if you want to understand why it works with wget but not with curl. – Andrey Tyukin Feb 20 '18 at 12:59

How to search through links and save only those that contain a specific data?

2 Answers2