Search term in url query-string
With this simple query...
xidel "https://www.google.com/search?q=xidel+follow+pagination" -e "$url"
https://consent.google.com/ml?continue=[...]
...you'll notice we're hitting a cookie-wall. With -f "//form"
Xidel can "click" on the consent-button.
Extract the urls:
xidel -s "https://www.google.com/search?q=xidel+follow+pagination" ^
-f "//form" -e "//div[@class='egMi0 kCrYT']/a/@href"
/url?q=https://stackoverflow.com/questions/37262813/xidel-how-to-follow-pagination-html-and-extract-url&sa=U&ved=2ahUKEwjQ7eCblIL4AhXCjqQKHVOcCNoQFnoECAYQAg&usg=AOvVaw2Yyh9OVSR_FLKehWApnFK2
/url?q=https://stackoverflow.com/tags/xidel/hot%3Ffilter%3Dall&sa=U&ved=2ahUKEwjQ7eCblIL4AhXCjqQKHVOcCNoQFnoECAIQAg&usg=AOvVaw25MiKPwJB0jVHz2JTl5mBp
/url?q=https://www.adoclib.com/blog/how-to-extract-using-xidel-all-srcset-width-strings-from-an.html&sa=U&ved=2ahUKEwjQ7eCblIL4AhXCjqQKHVOcCNoQFnoECAgQAg&usg=AOvVaw3BfrZCAGHHs_nqpJ-1aj2u
[...]
xidel -s "https://www.google.com/search?q=xidel+follow+pagination" ^
-f "//form" -e "//div[@class='egMi0 kCrYT']/a/resolve-uri(@href)"
https://www.google.com/url?q=https://stackoverflow.com/questions/37262813/xidel-how-to-follow-pagination-html-and-extract-url&sa=U&ved=2ahUKEwif0IL0mIL4AhUHtKQKHSh7DhoQFnoECAkQAg&usg=AOvVaw2o5RqheOFbiQv-KFW7Jhxd
https://www.google.com/url?q=https://stackoverflow.com/tags/xidel/hot%3Ffilter%3Dall&sa=U&ved=2ahUKEwif0IL0mIL4AhUHtKQKHSh7DhoQFnoECAgQAg&usg=AOvVaw19rnj9nPwMX-zKVSNzacrw
https://www.google.com/url?q=https://www.adoclib.com/blog/how-to-extract-using-xidel-all-srcset-width-strings-from-an.html&sa=U&ved=2ahUKEwif0IL0mIL4AhUHtKQKHSh7DhoQFnoECAcQAg&usg=AOvVaw3T4VVe92ucN0Jc7hzvAn8Y
[...]
xidel -s "https://www.google.com/search?q=xidel+follow+pagination" ^
-f "//form" -e "//div[@class='egMi0 kCrYT']/a/request-decode(resolve-uri(@href))"
{
"url": "https://www.google.com/url?q=https://stackoverflow.com/questions/37262813/xidel-how-to-follow-pagination-html-and-extract-url&sa=U&ved=2ahUKEwid9bHXmYL4AhWEIMUKHabxAoAQFnoECAAQAg&usg=AOvVaw1qftOzBqM1OfXkWkkJm0B8",
"protocol": "https",
"host": "www.google.com",
"path": "url",
"query": "q=https://stackoverflow.com/questions/37262813/xidel-how-to-follow-pagination-html-and-extract-url&sa=U&ved=2ahUKEwid9bHXmYL4AhWEIMUKHabxAoAQFnoECAAQAg&usg=AOvVaw1qftOzBqM1OfXkWkkJm0B8",
"params": {
"q": "https://stackoverflow.com/questions/37262813/xidel-how-to-follow-pagination-html-and-extract-url",
"sa": "U",
"ved": "2ahUKEwid9bHXmYL4AhWEIMUKHabxAoAQFnoECAAQAg",
"usg": "AOvVaw1qftOzBqM1OfXkWkkJm0B8"
}
}
[...]
xidel -s "https://www.google.com/search?q=xidel+follow+pagination" ^
-f "//form" -e "//div[@class='egMi0 kCrYT']/a/request-decode(resolve-uri(@href))/params/q"
https://stackoverflow.com/questions/37262813/xidel-how-to-follow-pagination-html-and-extract-url
https://stackoverflow.com/tags/xidel/hot?filter=all
https://www.adoclib.com/blog/how-to-extract-using-xidel-all-srcset-width-strings-from-an.html
[...]
Follow pagination:
Above final command extracts the urls from the 1st results page. To include the urls from the other results pages you can do a "recursive follow":
xidel -s "https://www.google.com/search?q=xidel+follow+pagination" ^
-f "//form" -e "//div[@class='egMi0 kCrYT']/a/request-decode(resolve-uri(@href))/params/q" ^
-f "//a[@aria-label and contains(.,'>')]"
-f "//a[@aria-label and contains(.,'>')]"
"clicks" the next-page-button until there are no more.
Note the warning by Xidel's author though: !!! Recursive follow is deprecated and might be removed soon. !!!
.
Search term through form()
A better alternative would be to visit the homepage and submit the search term through form()
. A user-agent is needed, but the cookie-consent-button is automatically "clicked" and the HTML-source is easier to parse.
Extract the urls:
xidel -s --user-agent "Mozilla/5.0 Firefox/100.0" "https://www.google.com" ^
-f "form(//form,{'q':'xidel follow pagination'})" -e "//div[@class='yuRUbf']/a/@href"
https://stackoverflow.com/questions/37262813/xidel-how-to-follow-pagination-html-and-extract-url
https://stackoverflow.com/tags/xidel/hot?filter=all
https://www.adoclib.com/blog/how-to-extract-using-xidel-all-srcset-width-strings-from-an.html
[...]
Follow pagination:
This can be done by yet another "recursive follow":
xidel -s --user-agent "Mozilla/5.0 Firefox/100.0" "https://www.google.com" ^
-f "form(//form,{'q':'xidel follow pagination'})" -e "//div[@class='yuRUbf']/a/@href" ^
-f "//a[@id='pnnext']/@href"
Changing the form()
-parameters however is a lot easier in this case:
xidel -s --user-agent "Mozilla/5.0 Firefox/100.0" "https://www.google.com" ^
-f "form(//form,{'q':'xidel follow pagination','num':'100'})" -e "//div[@class='yuRUbf']/a/@href"
I don't know if num
has a hard limit or not, but 100 seems to work at least.