0

I am running a wordpress site on a Ubuntu 20.04 based LEMP server. I have the pagespeed plugin enabled, and in order to force it to cache my website, I am using wget from a different box to mirror the site. However, when using wget from a 2nd box, It stops downloading at the first page (index.html), with the error

nofollow attribute found in /tmp/ramdisk/www.example.com/index.html. Will not follow any links on this page Below is the wget command I am using and the return results:

wget -m -p -E -k -P /tmp/ramdisk/ https://www.example.com
--2022-05-17 16:41:40--  https://www.example.com/
Resolving www.example.com (www.example.com)... 1**.2*.1**.*
Connecting to www.example.com (www.example.com)|1**.2*.1**.*|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘/tmp/ramdisk/www.example.com/index.html’

www.example.com/index.html                                     [   <=>                                                                                                                       ] 130.71K   210KB/s    in 0.6s

Last-modified header missing -- time-stamps turned off.
2022-05-17 16:41:42 (210 KB/s) - ‘/tmp/ramdisk/www.example.com/index.html’ saved [133848]

nofollow attribute found in /tmp/ramdisk/www.example.com/index.html. Will not follow any links on this page
FINISHED --2022-05-17 16:41:42--
Total wall clock time: 2.0s
Downloaded: 1 files, 131K in 0.6s (210 KB/s)
Converting links in /tmp/ramdisk/www.example.com/index.html... 135.
42-93
Converted links in 1 files in 0.004 seconds.

How can I go about finding the nofollow attributes and removing them so wget will fully download my website?

DanRan
  • 73
  • 1
  • 3
  • 22

2 Answers2

1

As documented here you can tell wget to ignore the no-follow attribute by adding the parameter -e robots=off

Gerald Schneider
  • 23,274
  • 8
  • 57
  • 89
0

I figured this out.

I had to log into my wordpress installation via the web interface, and go to Settings>Reading>Search engine visibility, then on that page I had to uncheck the

Discourage search engines from indexing this site It is up to search engines to honor this request.

option. After I unchecked that, I could successfully mirror my site using the wget command wget -m -p -E -k -P /tmp/ramdisk/ https://www.example.com.

See the screenshot below for more info. Wordpress - Search Engine Visibility - Discourage Search Engines from Indexing this site

DanRan
  • 73
  • 1
  • 3
  • 22