0

I have a hosted wordpress website (not on AWS) and instead of storing large files (ie: audio and video) on the webserver, I use html href links to redirect the request to where they are stored in a S3 bucket (eg: in the website html code, if someone wants to listen to (or download a recording) the code is something like this: a href="https://cdnxxxxxx.s3.eu-west-1.amazonaws.com/name_of_audio_file.mp3" Listen to NAME of AUDIO FILE

My issue is that the bucket policy originally was public READ only (ie: Principal "*", GetObject), but webscrapers and others are following the links and scraping the data (I currently monitor and analyse S3 access logs). I have amended the policy to this, which enforces TLS 1.2 or higher, and (I hope) denys a specific list of scrapers:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "PublicReadGetObjectIF-SSL>1.1",
            "Effect": "Allow",
            "Principal": "*",
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::cdnxxxxxx/*",
            "Condition": {
                "NumericGreaterThan": {
                    "s3:TlsVersion": "1.1"
                }
            }
        },
        {
            "Sid": "UserAgentDenyGetObject",
            "Effect": "Deny",
            "Principal": "*",
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::cdnxxxxxx/*",
            "Condition": {
                "StringLike": {
                    "aws:UserAgent": [
                        "Baiduspider",
                        "AhrefsBot",
                        "Semrush",
                        "yandex",
                        "2ip",
                        "ALittle",
                        "ZoominfoBot",
                        "cpp-httplib",
                        "Expanse",
                        "8LEGS",
                        "coccocbot",
                        "Pandalytics"
                    ]
                }
            }
        }
    ]
}

Should I rather amend the policy to deny access to all except the CloudFlare IP ranges (v4 and v6) as well as my server IP? or is there a more secure method that I havent thought of?

Many thanks

Mark B
  • 183,023
  • 24
  • 297
  • 295
CalvinR
  • 1
  • 1
  • "Should I rather amend the policy to deny access to all except the Cloudflare IP ranges (v4 and v6) as well as my server IP?" Yes, that is exactly what you should do. And once the bucket is locked down that way, you should make use of the features in Cloudflare, such as the Web Application Firewall, and Super Botfight Mode, to blocks those bots from scraping your static files. Although you will also need to start serving the files through Cloudflare, currently you are linking directly to the S3 bucket and bypassing Cloudflare. – Mark B Jun 21 '23 at 12:01
  • Thanks Mark - I agree with you, but that'll mean a HUGE amount of work and time finding and editing all the hrefs on the website. Its a typical issue of not thinking about the use case and security implications initially. Good news is CloudFlare is already configured with all your other suggestions! – CalvinR Jun 22 '23 at 15:52
  • There are several wordpress plugins out there that will do the work of finding the S3 URLs in your database, and changing them to a different URL, such as this: https://wordpress.org/plugins/better-search-replace/ Of course always backup your database before running something like that. Going forward I highly recommend using this plugin to handle copying your media assets to S3, and rewriting your URLs to use Cloudflare: https://deliciousbrains.com/wp-offload-media/ If you are using that plugin, then it is trivial to change to a different CDN or cloud storage solution later. – Mark B Jun 22 '23 at 16:29

0 Answers0